Data Science Interview Question #2

Jul 04, 2023

The second interview question comes from data superstar and SQL extraordinaire rxycommar. He wrote up a blog post tying the first interview question from Twitter into a commentary on issues with the field of data science and the superficial understanding of statistics and linear algebra that plagues certain corners of it. His blog is full of insightful commentary on DS and DS-adjacent topics, and his emphasis on truly understanding the basics deeply resonates with me. I recommend following him! (Though if you’re here, it’s probably because you already know of him and found me while looking him up.)

The second question comes from a tweet he posted last year:

The question reads (slightly paraphrased and including extra info and corrections from the comments):

Let’s say you fit a univariate regression model:

$$y = \alpha + \beta x + \epsilon$$

And you get the following results using OLS: \(\alpha = 3\), \(\beta = 2\), and \(\epsilon \sim \mathcal{N}(\mu=0,\sigma=1)\).

Now imagine that your coworker runs the same regression, but on a version of the data where the left-hand side contains a measurement error uncorrelated with the existing residual (The x data is the same.)

$$y^{*} \equiv y + u$$

$$y^{*} = \alpha^{*} + \beta^{*}x + \epsilon^{*}$$

where \(u \sim \mathcal{N}(\mu=2,\sigma=2)\) and \(\epsilon^{*} \sim \mathcal{N}(\mu^{*},\sigma^{*})\).

What are \(\alpha^{*}\), \(\beta^{*}\), \(\mu^{*}\) and \(\sigma^{*}\) in your coworker’s regression (when fit using OLS)?

Intuitively, what we are doing here is shifting the data up 2 units while increasing the variance. The increase in variance will be reflected within \(\epsilon^{*}\), leading to an increase in \(\sigma^{*}\). But because \(\epsilon^{*}\) is a residual, \(\mu^{*}=0\) and the mean shift will instead be assigned to the intercept coefficient \(\alpha^{*}\). Lastly, \(\beta^{*}=\beta\) because none of the changes introduced depend on the value of \(x\). Below we will work out the math.

First, from the original problem, we can calculate the mean and variance for \(y\) with the OLS solutions:

$$\mathbb{E}[y|x] = \alpha + \beta x = 3 + 2 x$$

$$\text{var}(y|x) = \sigma^2 = 1$$

And we can do the same for \(y^{*}\)

$$\mathbb{E}\left[y^{*}|x\right] = \alpha^{*} + \beta^{*} x + \mu^{*}$$

$$\text{var}\left(y^{*}|x\right) = (\sigma^{*})^2$$

Using the definition of \(y^{*}\),

$$\mathbb{E}\left[y^{*}|x\right] =\mathbb{E}\left[y|x\right] + \mathbb{E}[u] =  3 + 2x + 2 = \alpha^{*} + \beta^{*}x + \mu^{*}$$

So we can see that \(\beta^{*}=2\). And because \(\epsilon^{*}\) is a residual in the modified regression problem, we must have \(\mu^{*}=0\). Thus \(\alpha^{*} = 5\). Looking now at the variances,

$$\text{var}\left(y^{*}|x\right) = \text{var}\left(y|x\right) + 2^2 = 1 + 4 = 5 = (\sigma^{*})^2$$

So \(\sigma^{*} = \sqrt{5}\). Thus,

$$y^{*} = 5 + 2x + \epsilon^{*}$$


\(\epsilon^{*} \sim \mathcal{N}(\mu^{*}=0,\sigma^{*}=\sqrt{5})\).