Data Science Interview Question #1

Jun 26, 2023


I stumbled upon an interview question for a junior data scientist posed by the Twitter user Quantian (though some argue it isn’t actually a junior level question!):

interview question from Quantian on Twitter that reads: "Junior data scientist interview question: Assume you generate points X = N(0,1); Y = N(0,0.1). Rotate the (x,y) dataset 45 degrees, so they look something like pic below (line is y=x). If you were to calculate the OLS regression y = b1*x + b0, what is E[b1] as n -> infinity?". A graph is presented showing a line through the rotated data as described above.

The question is:

Junior data scientist interview question: Assume you generate points X = N(0,1); Y = N(0,0.1). Rotate the (x,y) dataset 45 degrees, so they look something like pic below (line is y=x). If you were to calculate the OLS regression y = b1*x + b0, what is E[b1] as n -> infinity?

Per the author, the Gaussian distributions are written in the form \(\mathcal{N}(\mu, \sigma)\).

It’s been a while since I’ve thought about interviews or tackled these kinds of exercises, so I’ve decided to work it out on here! If all goes well, this will be the first interview question in a series of posts (from Twitter or elsewhere) that I will present on this blog.


Solution

We have a dataset generated by a 2D normal distribution with different variances in each dimension. There are two main parts to this question:

  • How does the rotation of the dataset affect the distribution?
  • How do we compute the regression coefficients from this, specifically the OLS slope?

Before rotation, the covariance matrix is:

$$C = \begin{pmatrix}1 & 0 \\ 0 & (0.1)^2 \end{pmatrix}$$

If we rotate the data 45 degrees, we must multiply our random variable vector by a 45-degree rotation matrix. Recall that the 2D rotation matrix for angle \(\theta\) is given by

$$ R(\theta) = \begin{pmatrix}\cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta)\end{pmatrix} $$

Applying this to the value \(\theta = \frac{\pi}{4}\) and multiplying by the vector (X, Y), we get

$$ \vec{X}’ = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & -1 \\ 1 & 1\end{pmatrix}  \begin{pmatrix} X \\ Y \end{pmatrix} $$

What are the mean and variance of this new distribution?

$$ \mathbb{E}(\vec{X}’) = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & -1 \\ 1 & 1\end{pmatrix}\mathbb{E}(\vec{X}) = \vec{0}$$

$$\text{var}(\vec{X}’) = \frac{1}{2} \begin{pmatrix} 1 & -1 \\ 1 & 1\end{pmatrix} \text{var}(\vec{X})\begin{pmatrix} 1 & -1 \\ 1 & 1\end{pmatrix} ^{T} $$

$$= \frac{1}{2} \begin{pmatrix} 1 & -1 \\ 1 & 1\end{pmatrix} \begin{pmatrix}1 & 0 \\ 0 & 0.01 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ -1 & 1\end{pmatrix} $$

Simplifying,

$$\text{var}(\vec{X}’) = \frac{1}{2} \begin{pmatrix} 1 & -1 \\ 1 & 1\end{pmatrix} \begin{pmatrix}1 & 1 \\ -0.01 & 0.01 \end{pmatrix} $$

$$= \frac{1}{2} \begin{pmatrix} 1.01 & 0.99 \\ 0.99 & 1.01\end{pmatrix} $$

Armed with this new distribution, we can compute the OLS slope:

$$b_1 = \frac{\text{cov}(X’,Y’)}{\text{var}(X’,X’)} = \frac{0.5  (0.99)}{0.5 (1.01)} \approx 0.98$$

Thus, the slope is close to, but not exactly 1. If the Y-distribution had a standard deviation of zero, then this would just be a straight line with a 45-degree angle and a slope of 1. We can generalize the problem by choosing any standard deviation \(0 \leq \sigma_y < 1\), with solution

$$b_1 = \frac{1 – \sigma_y^2}{1 + \sigma_y^2} $$