Intro to representing data0. Learning objectives
We are going to shift gears a bit. We'll start talking about how computers
What do think you should get if you evaluate these expressions?
|
![]() |
1. How does a computer work?
|
At some point, all the statements and data that we have been working with in |
|
Remember that each transistor can take on two possible states. Let's consider the case of two transistors and we'll call these transistor A and transistor B.
Suppose transistor A is ON. Then transistor B can be either ON or OFF. So there are two possibilities there.
Now suppose transistor A is OFF. The transistor B can still be either ON or OFF.Another two possibilities!
This means there are a total of 4 possible states when we have two transistors. This generalizes to $s = 2^n$ where $s$ is the number of possible states, and $n$ is the number transistors. So with three transistors, there are 8 possible states.
So our fundamental unit for storing information is something that is either 0 or 1. We call this a BInary digiT, or a bit for short.
Let's explore an example of all unique combination that we can have if we use 4 bits.
| bit 1 | bit 2 | bit 3 | bit 4 |
Each set of bits (row) represents a unique combination and we can count these combinations (right column). The larger the number of bits that we use, the more combinations we can have. Just like we said the number of possible combinations with $n$ bits is equal to $2^n$. So, what if we decide to associate an integer to each combination like we have done in the right column? Then we could represent any number from $0$ to $2^n - 1$ with $n$ bits.
There are some common sizes of bits that are usually used to represent numbers: a nibble consists of 4 bits, a byte consists of 8 bits and a word consists of 32 bits (though 64 bits is becoming more common). What is the largest positive integer you can represent with 64 bits?
2. Representing information: positive integers
If you look at the table above, we decided that 10012 (the subscript 2 is to clarify
that this number is written in base-2 - more about it below - and distinguish it from
one-thousand-and-one) is a representation of number 910, and
11112 is a binary representation for the number 1510. But how do
we go from the 0s and 1s to a number we're familiar with? We'll look at this shortly, but first we need to
understand the different ways numbers can be represented. In fact, the numbers 910 and
1510 are represented in the "usual" decimal system we are familiar with, i.e. we use
base-10.
Let's explore some common ways of representing numbers, and then we'll look at how we can convert between one number representation and another.
The main concept to remember when we look at different representations is the idea of the base, which represents how many options we have for each digit. You're definitely familiar with the base-10 numbering system (any idea why 10 is so popular?). That is, the numbering system in which we have 10 digits (0-9) to represent a number. What numbers can you represent with two base-10 digits? Well we can represent the numbers 010-9910, or up to $10^2 - 1$. With $3$ base-10 digits? We can represent the numbers 010-99910, or $10^3-1$. Hmm, this sounds familiar. Therefore, with a base $b$ numbering system and $n$ digits, we can represent numbers between 0 and $b^n-1$.
2.1. Binary numbers (base-2)
The goal we have is to represent different types of numbers as a sequence of 1s and 0s. For example,
11012 is a number represented in binary, and 10011110101101102
is also a binary number. We can also forget about leading 0s: in other words,
11012 is the same number as 00000000000011012.
To convert from a binary (base-2) number to a base-10 number, you should read the bits from right to left. Since we work with positional numbering system, where the position of a digit represents how "significant" that digit is within the number, we can write a number as:
$$d_{3}d_{2}d_{1}d_{0} = d_{3}\cdot b^{3}+d_{2}\cdot b^{2}+d_{1}\cdot b^{1}+d_{0}\cdot b^{0}$$
Where the $d$ are the digits, $b$ is the base, and the subscripts represent the location. This means that 5183 can be thought as:
$$5183=5\cdot10^{3}+1\cdot10^{2}+8\cdot10^{1}+3\cdot10^{0}$$
If we apply this idea to binary numbers, then the rightmost bits represent the lowest powers of two
(starting with $2^0$ at the right). For example, let's look at the number 11012:
$$ 1101_{2} = 1\cdot2^{3}+1\cdot2^{2}+0\cdot2^{1}+1\cdot2^{0} = 8 + 4 + 1 = 13.$$
This mean that we can use this process to find a "natural" mapping between integers in base 10 and binary
numbers. So, the binary number 11012 represents the number
1310 in the computer! Perfect since it is only 1s and 0s.
Let's do another example: try to see what decimal integer is represented by the 16-bit number
110010012
This is a long one. Remember, we start counting positions from the very right (position zero) and each of the digit counts as current base (in this case 2) to the power of that position. So, the rightmost digit is worth $2^{0}$, the next one to its left $2^{1}$, the next one over $2^{2}$ and so on:
$$ 1\cdot2^{7} + 1\cdot2^{6} + 0\cdot2^{5} + 0\cdot2^{4} + 1\cdot2^{3} + 0\cdot2^{2} + 0\cdot2^{1} + 1\cdot2^{0} = 128 + 64 + 8 + 1 = 201. $$
Great, right? You can convert any integer number from base-2 (binary) to base-10 (decimal). Here is a few more for you to check out:
11012
100100012
111000002
111111001102
11012 => 1310
100100012 => 14510
111000002 => 22410
111111001102 => 202210
So we can convert numbers from binary to decimal. But how do we convert decimal to binary?
Here is one way"
- Find the highest power of 2 that is less than or equal to $x$. For example, for the number 13, the highest power is $3$ (i.e. $2^3 = 8$ is the highest power of two less than 9 since the next power of two, $2^4$ is equal to 16). Save the exponent of the highest power of 2 and call it $p$ (in this example $p = 3$). Put a 1 at that position (3) in the binary number that you are building: 1xxx2 (for our example with 13).
- Subtract this highest power of 2 from $x$, i.e. compute the difference $y = x - 2^p$. In our case $y=13-2^{3}=5$
- Go back to step 1 and, this time, use $y$ as starting number, i.e. find the highest power of $2$ that is less than or equal to $y$. In this case 2 since $2^{2}=4$ and $2^{3}=8$ and $y=5$. Mark a 1 at position 2 in the binary number you are building (11xx2). If you have positions in your binary number that you have not filled between this round and the previous round, then you should fill them up with 0s. (Try, for example, to follow this process with the number 910
- Keep going until $y$ is either 0 or 1, at which point you can write that in position 0, since $2^{0}=1$.
Sounds recursive, doesn't it? Maybe you can challenge yourself to write a function that does this conversion.
Using the steps above, convert 23310 into binary.
111010012.
It's always a good idea to check your result by using our previous algorithm for converting from binary back
to decimal, i.e. computing $ 1\cdot2^{7} + 1\cdot2^{6} + 1\cdot2^{5} + 1\cdot2^{3} + 1\cdot2^{1} = 128 + 64
+ 32 + 8 + 1 = 233_{2}$. 2.2. Hexadecimal numbers (base-16)
In a hexadecimal system, each digit can hold one of 16 values. So, what can we use for digits? We can start with our friends 0-9 to cover the first 10, then we need something else for the following 6. One of the most convenient choices is letters since they have a sort of natural ordering. So we are going to use A-F to represent the digits with value 10-15. This might sound weird, but it's very convenient when handling larger binary numbers. Remember that a cluster of 4 bits (a nibble) can represent a number between 0-15 which conveniently map to hexadecimal numbers between 0-F. So, in a large binary number, you can read off clusters of 4 bits starting from the right, convert each individual cluster into an hexadecimal digit, and then, if you need to, convert the hexadecimal number into a decimal to figure out what original binary number represented.
Using 4 bits, the conversion from binary (base-2), decimal (base-10) and hexadecimal (base-16) systems is summarized in the following table.
| bit 1 | bit 2 | bit 3 | bit 4 | decimal | hexadecimal |
Converting to/from a hexadecimal system follows the same algorithms that we described above to convert to/from a binary system. The only difference is that we work with powers of 16 instead of powers of 2.
For example, 2A16 = $2 \cdot 16^1 + 10 \cdot 16^0 = 42_{10}$ (remember that A16 represents 1010).
Similarly to what we did for binary, converting the decimal number 98110 to hexadecimal requires finding the highest power of 16 that is less than 98110, which in this case is $16^2 = 256$. The trick, now, is to find how many 25610 fit in 98110. In binary, we only had to consider two digits either 0 or 1. In hexadecimal we might be able to fit up to 15 of our powers into the number! In this example, we can fit $3\cdot16^2$ into 98110, since $3 \times 16^2 = 768_{10}$. So, the digit that we want to write down at position 2 is 3: 3xx16. Next, we compute 98110-76810 = 21310. The next highest power of 16 is $16^1$, and we can fit 13 of those into 21310, since $13\cdot16^1 = 208_{10}$. Now, the hexadecimal digit that corresponds to 1310 is D16. So we write that at position 1: 3Dx16. We then have 5 left over, so the $16^0$ slot is 5. Therefore, 98110 can be represented as 3D516 in hexadecimal.
Alternatively, you could convert the decimal number to binary, and then read off chunks of 4 to convert to hexadecimal. Following the procedure from the previous section gives
$$ 981_{10} = \underbrace{\texttt{0011}}_{3_{10}}\underbrace{\texttt{1101}}_{13_{10}} \underbrace{\texttt{0101}_{2}}_{5_{10}} $$which is again 3D516. Note that you'll need to write out the above table (or remember it) to do the conversion.
2.3. Positive integer representation summary
In every base, the representation of a number is positional just like the decimal representation is.
The following table shows the two representations for the first 7 positions.
| Digit position | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| Positional representation in an arbitrary base b | b7 | b6 | b5 | b4 | b3 | b2 | b1 | b0 |
| Representation for b = 10 |
107 | 106 | 105 | 104 | 103 | 102 | 101 | 100 |
| Values for b = 10 |
10000000 | 1000000 | 100000 | 10000 | 1000 | 100 | 10 | 1 |
| Representation for b = 2 |
27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 |
| Values for b = 2 |
128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |
| Representation for b = 16 |
167 |
166 |
165 |
164 | 163 | 162 | 161 | 160 |
| Values for b = 16 |
268435456 | 16777216 | 1048576 | 65536 | 4096 | 256 | 16 | 1 |
Ex: 1034 = (value in decimal - base 10 - representation)
1*1000 + 0*100 + 3*10 + 4*1 = (in decimal representation)
1*103 + 0*102 + 3*101 + 4*100 =
1034 (value in decimal representation)
1011 = (value in binary - base 2 - representation)
1*8 + 0*4 + 1*2 + 1*1 =
1*23 + 0*22 + 3*21 + 4*20 =
11 (value in decimal representation)
2A = (value in hexadecimal - base 16 - representation)
2*16 + A*1 =
2*161+ A*160 =
42 (value in decimal representation)
In bases larger that 10 - for example the hexadecimal (b = 16) - we need to encode digit values larger than 9 with unique symbols. The usual choice is to use letters since they have an underlying sorting (A < B < C < ...). For hexadecimal, the mapping is in the following table:
| Digit in hexadecimal | Decimal representation | Binary representation |
| 0 | 0 | 0000 |
| 1 | 1 | 0001 |
| 2 | 2 | 0010 |
| 3 | 3 | 0011 |
| 4 | 4 | 0100 |
| 5 | 5 | 0101 |
| 6 | 6 | 0110 |
| 7 | 7 | 0111 |
| 8 | 8 | 1000 |
| 9 | 9 | 1001 |
| A | 10 | 1010 |
| B | 11 | 1011 |
| C | 12 | 1100 |
| D | 13 | 1101 |
| E | 14 | 1110 |
| F | 15 | 1111 |
So, for each digit of hexadecimal (hex), you have 4 uniquely corresponding digits of binary. This makes the conversion from binary to hex (and vice-versa) straightforward.
Ex: ABCD = 1010101111001101
104F = 0001000001001111
2.4 Doing math with binary numbers
We can actually do arithmetic with binary numbers the same way we do arithmetic with decimal numbers. The main
difference is the "carry" principle. When doing arithmetic with decimal numbers, you "carry a one" if your
current digit is greater than 9. In binary arithmetic, we "carry a one" if the current digit is greater than 1.
For example if we want to add the two binary numbers 1112 + 1102: $$
\begin{array}{cccc} \textcolor{red}{1} & \textcolor{red}{1} & & \\ & 1 & 1 & 1 \\ + & 1 & 1 & 0 \\ \hline
\textcolor{blue}{1} & \textcolor{blue}{1} & \textcolor{blue}{0} & \textcolor{blue}{1} \end{array} $$ which gives
11012 which is equal to 1310. Note that 1112 is
710 and 1102 is 610, so $7 + 6 = 13$). As an exercise, try to add
11012 with 00112.
11012 represents 1310, and 00112 represents
310, so we whould get 1610: $$ \begin{array}{ccccc} \textcolor{red}{1} &
\textcolor{red}{1} & \textcolor{red}{1} & \textcolor{red}{1} & \\ & 1 & 1 & 0 & 1 \\ + & 0 & 0 & 1 & 1 \\
\hline \textcolor{blue}{1} & \textcolor{blue}{0} & \textcolor{blue}{0} & \textcolor{blue}{0} &
\textcolor{blue}{0} \end{array} $$ which is 1610!
What about multiplication? Multiplication follows the same idea as in decimal arithmetic. Actually,
multiplication in binary arithmetic is nicer because the maximum you can get with any multiplication of a 0 and
1 is 1, so you never need to "carry" anything in the multiplication phase. You only need to carry a 1 during the
addition phase. For example, multiplying 102 (210) with
112 (310) should give 610:
For more practice, try to multiply 11012 with
10012.
11012 represents 1310, and 10012 represents
910, so we whould get 11710: $$ \begin{array}{ccccccc} & & & & 1 & 1 & 0 & 1 \\ & & &
\times & 1 & 0 & 0 & 1 \\ \hline & & & & 1 & 1 & 0 & 1 \\ & & & 0 & 0 & 0 & 0 & \\ & & 0 & 0 & 0 & 0 & & \\
+ & 1 & 1 & 0 & 1 & & & \\ \hline & 1 & 1 & 1 & 0 & 1 & 0 & 1 & \end{array} $$ which is 11710!
3. Representing information: real numbers
The systems we have looked at so far are good for representing integers. But what about real numbers?! For example, how would you represent the number $0.5$? Well, let's think about how we can write 0.5 if we think about our base-10 representation: $0.5 = 0 \cdot 10^{0} + 5 \cdot 10^{-1}$. This seems to indicate that if we have position from 0 up to the left of the decimal point, we have position from -1 down to its right. What if we extend our base-2 system to include negative positions as well? In our positional representation, they will result in negative powers of 2. This means that, if we agree on a spot in the middle of our binary number where the "decimal point" should go, 0.5 could be represented with a single bit to the right of this location that represents $2^{-1}$:
$$ 0.5_{10} \rightarrow 0.1_{2} = 0 \cdot 2^0 + 1 \cdot 2^{-1} $$
The same idea can be used to represent numbers like $0.25_{10}$, which is $0.01_{2} = 0 \times 2^0 + 0 \times 2^{-1} + 1 \times 2^{-2}$, and $0.75_{10}$ which is $0.11_{2} = 0 \times 2^0 + 1 \times 2^{-1} + 1 \times 2^{-2}$. But now what about 3.2510? 6.510? 1310? The issue we encounter is that all of these numbers have the same pattern of 0s and 1s:
$$\begin{eqnarray} 3.25_{10} &= 1 \times 2^1 + 1 \times 2^0 + 0 \times 2^{-1} + 1 \times 2^{-2} &= 11.01_{2} \\ 6.5_{10} &= 1 \times 2^2 + 1 \times 2^1 + 0 \times 2^0 + 1 \times 2^{-1} &= 110.1_{2} \\ 13_{10} &= 1 \times 2^3 + 1 \times 2^2 + 0 \times 2^1 + 1 \times 2^0 &= 1101_{2} \end{eqnarray} $$
I added a decimal point in the examples above, which highlights that somehow we'll need to keep track of where the decimal point should go, with the understanding that, in binary representation, there is actually NO decimal point and what we do is somehow subdivide the bits in the binary number and decide which one represent the integer part and which one are used to represent the decimal part of the original real number. On top of that, we also need a way to keep track of the sign of the number. Is it positive or negative? One way to do this is to use a bit to identify the sign: 0 if positive and 1 if negative.
All these details are captured in a standard way that is used to represent real numbers in computers. This is called the floating-point numbers notation and it has a format similar to the scientific notation:
$$ -1^s \times F \times 2^e $$
where the first bit to the left is used to store the sign of the number ($s$), a certain number of bits for the mantissa $F$, and another bunch of bits for the exponent $e$. This representation is like a base-2 form of scientific notation.
We are not going into the details about how real numbers convert back and forth from this notation, you will
see a little more about this in computer architecture. What we are going to do here, is to get a feeling about
why certain numbers cannot be exactly represented using this notation, giving us weird answers like
3 * 0.1 -> 0.30000000000000004.
Remember that we represented integers with a fixed number of bits, so there was some largest integer that we could represent. For example we could represent the numbers 0-255 with 8 bits, but cannot represent 260 unless we use an extra bit. The floating-point representation can only use a fixed number of bits as well that we can use to store $F$ and $e$. However, since we are restricted to use powers of 2, not every number can be exactly represented. For example, the number $0.00675$ is representable exactly because it is $2^{-4}$ and $0.125$ is representable exactly because it is $2^{-3}$. Turns out that $0.1$ is not representable exactly! Let's try converting $0.1_{10}$ into a binary representation using an extension of the process we used above.
Starting with $0.1_{10}$, the highest power of 2 bigger than $0.1$ is $-4$ ($2^{-4} = 0.0675$), so in binary we have (with the decimal point only placed for clarity): $0.0001\dots_{2}$ where we still need to figure out what comes after position -4. Following our conversion procedure, we subtract: $0.1 - 2^{-4} = 0.0375$. Now we ask ourselves: what is the highest power of two that is less than $0.0375$? It is $2^{-5} = 0.03125$. So we can extend our binary representation: $0.00011\dots_{2}$. We now subtract $0.0375 - 2^{-5} = 0.00625$. If we keep going, we find a $1$ in the $2^{-8}$ slot, with a remainder of $0.0023475$, giving us $0.00011001\dots_{2}$ We then have a $1$ in the $2^{-9}$ slot, with a remainder of $0.000394375$, givin us $0.000110011\dots_{2}$. If you continue with this process, you might notice that the next digit is a 0, then another 0, then a 1, then another 1, then a 0.... and so on. It turns out that the representation of 0.110 in binary is periodic (just like 1/3 in decimal)!!! But we said that we only have a limited number of digits that we can use in our binary representation so, at some point, we will have to truncate our periodic binary number. This is what leads to loss of precision in our computations!
4. Representing information: other stuff
In order to represent floating-point numbers (as in the previous section), we had to come up with a standard (a convention we all agree on) for representing these numbers using bits. In fact, were using the IEEE 754 standard. We need these standards because, otherwise, a number might be represented in a certain way on computer A, but represented differently on computer B, which would be a big problem when trying to exchange information between those computers. There are many different standards that were developed over the years to make sure everybody builds computer that can easily exchange data (using 1s and 0s since that's all computers deal with).
Let's see an example considering characters. Imagine you sent a text message to your friend, where the characters on your phone are represented in a different way than they are on your friend's phone. This might cause some pretty bad communication problems! In fact, each character on your keyboard is associated with a specific sequence of bits. This is known as the American Standard Code for Information Interchange (ASCII), which is a standard for encoding characters. See the table below to the left. On the right we also show the table where, instead of the binary number associated with each character, we show the corresponding decimal and hexadecimal representation (You can verify that the binary values correspond to the decimal and hexadecimal values for a given character as an exercise!).
|
|
But what about emojis?? As you can see, the ASCII character set does not include emojis. Furthermore, ASCII is a very anglocentric encoding. What about all the beautiful languages in the word that do not use the Latin alphabet? In order to be able to include all these characters (and emojis) in your text message to your friend, then your message has to be encoded in a sequence of 1s and 0s and to do that, we need a mapping from each one of the characters and emojis into a sequence of bits. This is accomplished by the Unicode Standard (UTF-8).
Now, we could spend days and days here to talk about encoding (there are entire fields of engineering and computer science that study the best way to encode information), but the main message here is that in order to share information with other people, we need agree upon some encoding standard.
The same principle applies for all sort of other things that you use on your computer every day: sounds, images, videos, etc... Once we get familiar with few more data structures in Python, we are going to revisit this idea and explore how some of these things can be encoded and manipulated.
