Automatic Differentiation

Problem 1

Consider the function

\[ \begin{aligned} L(\mathbf{w}) = (y - \sigma(w_1 x + w_0))^2 + \lambda(w_1^2 + w_0^2)\;, \end{aligned} \]

where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is our friend the logistic sigmoid function. This is the loss function for a simple regression problem in which we aim to predict \(y\) using the function \(f(x) = \sigma(w_1 x + w_0)\). We’ve also included a regularization term.

Note: this isn’t a model we’ve discussed in class, or a very practical one – it’s just useful for an illustration of automatic differentiation.

Part A

Compute the derivatives \(\frac{\partial L}{\partial w_0}\) and \(\frac{\partial L}{\partial w_1}\) by hand, using the chain rule.

Part B

Let’s now study the computational cost of computing these derivatives. Assume that multiplication, division, addition, subtraction, and exponentiation all cost 1 computational unit. So, for example, computing \(\sigma(z) = \frac{1}{1 + e^{-z}}\) costs 3 units: one for the exponentiation, one for the addition, and one for the division.

How many computational units does it cost to compute \(\frac{\partial L}{\partial w_0}\) and \(\frac{\partial L}{\partial w_1}\). Please approach this question assuming that you are not allowed to store any intermediate values. So, for example, if one of your derivatives has the term \(\sigma(w_1 x + w_0)\) appearing multiple times, you need to compute it from scratch each time.

Part C

Now determine the number of computational units required to compute \(\frac{\partial L}{\partial w_0}\) and \(\frac{\partial L}{\partial w_1}\) if you are allowed to store intermediate values. For example, you might compute \(\sigma(w_1 x + w_0)\) once and store it in a variable, and then reuse that variable whenever you need to refer to \(\sigma(w_1 x + w_0)\) in your derivative. Your aim here is to minimize the number of computational units required; you can assume that you have no bound on your available storage.

Please state both the number of computational units required and the number of intermediate values you need to store.