Advanced Optimization: Newton’s Method and Adam
2025-04-11
The expectations for all blog posts apply!
Introduction
This is an advanced blog post that requires you to have previously completed our blog post on implementing logistic regression. If you haven’t implemented logistic regression yet, go do that first.
Part A: Newton’s Method
Newton’s Method is a second-order optimization technique. This means that it requires information about the second derivatives of the loss function \(L\) as well as the first derivatives. Here’s how Newton’s method works:
- We compute the usual gradient \(\nabla L(\mathbf{w})\). Recall that the gradient is the vector of first derivatives of \(L\).
- We also compute the Hessian matrix, which is the matrix of second derivatives of \(L\). For logistic regression, the Hessian is the matrix \(\mathbf{H}(\mathbf{w}) \in \mathbb{R}^{p \times p}\) with entries \[ \begin{aligned} h_{ij}(\mathbf{w}) = \sum_{k = 1}^n x_{ki}x_{kj}\sigma(s_k)(1-\sigma(s_k))\;. \end{aligned} \tag{1}\]
Once we know how to calculate the gradient and the Hessian, we repeat the update \[ \begin{aligned} w \gets w - \alpha \mathbf{H}(\mathbf{w})^{-1} \nabla L (\mathbf{w})\;. \end{aligned} \] until convergence. Here, \(\alpha > 0\) is a learning rate and \(\mathbf{H}(\mathbf{w})^{-1}\) is the matrix inverse of the Hessian matrix.
A.1: Implement Newton’s Method
Please implement a NewtonOptimizer
class that can use Newton’s method to estimate \(\mathbf{w}\) for a LogisticRegression
model as implemented in our logistic regression blog post. To do so, you’ll also need to extend your LogisticRegression
implementation with a new method hessian
that computes \(\mathbf{H}(\mathbf{w})\). Part of your task is to convert Equation 1 into a matrix-vector formulation that allows you to compute the Hessian using matrix multiplication (and avoiding for
-loops).
To check your implementation, make sure that, on the same data set, for sufficiently small \(\alpha\), Newton’s method converges to the same result that you would get using standard gradient descent.
A.2: Perform Experiments
Find an empirical data set suitable for binary classification. You can use anything that is not packaged with scikit-learn
; feel free to scour the internet or previous assignments for interesting data sets. Then, perform experiments on this data set and include visualizations to show that:
- When \(\alpha\) is chosen appropriately, Newton’s method converges to the correct choice of \(\mathbf{w}\).
- Under at least some circumstances, Newton’s method can converge much faster than the standard gradient descent you implemented in your previous blog post on logistic regression, in the sense of decreasing the empirical risk.
- If \(\alpha\) is too large, Newton’s method fails to converge at all.
Part B: Adam
The Adam optimization algorithm is a mainstay of modern deep learning. Unlike Newton’s method, which is usually applied to the entire data set at once, Adam is a variation on stochastic gradient descent. Adam was introduced by Kingma and Ba (2015), in a paper which has been cited over 140,000 times.
B.1: Implement Adam for Logistic Regression
Expand your implementation of logistic regression to include an AdamOptimizer
which uses the algorithm described in the paper. Provided that the other components of your logistic regression implementation are correct, you won’t need to modify any component of the model other than the optimizer.
The authors of the paper frame their algorithm as minimizing a stochastic (random) objective function \(f\). This is a fancy mathematical way to talk about stochastic batch gradient descent, which refers to selecting a subset of data points and computing the gradient from them. When they talk about evaluating/differentiating the “stochastic function” \(f\), you can instead think of “evaluating/differentiating \(f\) on a subset of the data.” So, you can think of \(g_t\) in their algorithm as the gradient evaluated on a batch of the data. You can still follow the standard structure of stochastic gradient descent in which you loop through the entire data set in one epoch, reshuffle the batches, and do it again.
Your optimizer implementation should allow the user to pass five arguments:
batch_size
, the batch size for computing gradients. This works in the same way as it did in our initial implementation of stochastic gradient descent, and you can use much of the same code.alpha
, the step-size.beta_1
and beta_2
, the moment estimate decay rates.w_0
, the initial guess for the weight vector. Personally I would suggest giving this argument a default None
value and, in the case that the user does not pass w_0
, initialize it randomly with the correct dimensions.
B.2: Test Adam
Test Adam on the same data set that you used for your test of Newton’s method above. Compare Adam to standard minibatch stochastic gradient descent (with the same batch sizes) with a few different choices of the step-size.
Part C: Compare Newton’s Method and Adam
Run an experiment on your chosen data set in which you compare the speed of convergence of Adam and Newton’s method. Because Adam and Newton’s method perform very different computations, it is not intrinsically meaningful just to compare “the number of steps” for convergence. Instead, please measure the clock time (in seconds) required for each method to reach a specified loss (which you may choose) on the data.
Part D: Discussion and Finishing Touches
- Add a link to your source code for your two optimizers on GitHub at the very top of your blog post.
- Add expository writing throughout your post, with special focus on carefully describing the purpose of your experiments and what the results mean. Please also make sure that all plots are appropriately labeled with axis labels and legends.
- Add an “abstract” paragraph in which you describe the overall topic and aims of your blog post.
- Add a concluding paragraph in which you reflect on what you achieved and what you learned.
© Phil Chodrow, 2025