Checking Dimensions in Deep Learning Models

Part A

Let \(\mathbf{x}\in \mathbb{R}^p\) be a vector with \(p\) input features (we’ll assume that the constant feature is included). We can represent a simple neural network with \(k\) outputs as a function \(f: \mathbb{R}^p \to \mathbb{R}^k\) that maps the input vector \(\mathbf{x}\) to an output vector \(\mathbf{y}= f(\mathbf{x})\) using the following formula:

\[ \begin{aligned} f(\mathbf{x}) = \alpha(\alpha(\mathbf{x}^T \mathbf{W}_1) \mathbf{W}_2)\mathbf{W}_3\;, \end{aligned} \]

where \(\alpha\) is a nonlinear function (like the ReLU) which is applied entrywise to its argument.

Each of the three matrices \(\mathbf{W}_i\) has dimensions \(r_i \times c_i\) for some positive integers \(r_i\) and \(c_i\). Please give an example of choices of \(r_i\) and \(c_i\) for \(i=1,2,3\) that would work for the above formula, in the sense that all matrix-vector operations are valid and that the input and output have correct dimensions.

Part B

Applying a convolutional kernel of size \(k \times k\) to an image with \(r\) rows, \(c\) columns, and \(d\) channels (e.g. RGB) results in an output with \(r - k + 1\) rows, \(c - k + 1\) columns, and \(1\) channel.

Consider the following simple convolutional neural network:

class SmallConvNet(nn.Module):
    def __init__(self):
        super().__init__()

        self.pipeline = torch.nn.Sequential(
            nn.Conv2d(d_channels, m_kernels, kernel_size),
            ReLU(),
            nn.Flatten(),
            nn.Linear(input_dimension, output_dimension)
        )

    def forward(self, x):
        return self.pipeline(x)

Here, d_channels is the number of channels in the input image (corresponding to \(d\) above), m_kernels is the number of convolutional kernels, kernel_size is the size of the convolutional kernel (corresponding to \(k\) above), input_dimension is the dimension of the flattened output of the convolutional layer, and output_dimension is the dimension of the output of the network.

Please give a formula for the required value of input_dimension in terms of d_channels, m_kernels, kernel_size, and the dimensions of the input image (i.e. \(r\) and \(c\)).

Hint: You may find it helpful to check your proposed formula against an example in the lecture notes.