Linear Regression

What makes linear regression linear is that we assume that the output truly can be expressed as a linear combination of the input features.

This seems different from what I’ve known: linear regression is linear w.r.t the parameters instead of the inputs.

It is actually the case that linear regression makes the assumption that the output is a linear combination of the input features. This means that your model assumes no other transformations besides multiplication by a scalar and addition is done to any of the input features.

Of course, you can change what you mean by ‘input features’ and apply some other (non-linear) transformations to the features first, but strictly speaking that is not part of the linear regression. Saying that linear regression is linear w.r.t the parameters is just a short hand way of saying that only linear combinations of the inputs with the parameters as coefficients are allowed.

I have a question about regression using mxnet, I am trying to use it on an expression like this where, w0 and x0 should be scalars:

auto stage1 = w0x0x0;
auto net = LinearRegressionOutput(“linreg”, stage1, Symbol::Variable(“label”));

My problem is that its not converging , I paste the whole code just in case.

#include “mxnet-cpp/MxNetCpp.h”

using namespace std;
using namespace mxnet::cpp;

int main(int argc, char** argv)
const float learning_rate = 0.01;

vector<mx_float> input =
vector<mx_float> output =

Context ctx0 = Context::cpu();

auto x0 = Symbol::Variable(“x”);
auto w0 = Symbol::Variable(“w0”);

auto stage1 = w0x0x0;
auto net = LinearRegressionOutput(“linreg”, stage1, Symbol::Variable(“label”));

NDArray daInputs = NDArray(input ,Shape(1,input.size()),ctx0);
NDArray daOutputs = NDArray(output,Shape(1,input.size()),ctx0);

std::map<string, NDArray> args0;

Optimizer* opt = OptimizerRegistry::Find(“adam”);
opt->SetParam(“lr”, learning_rate);//->SetParam(“wd”, weight_decay);

int epoch =100000;

auto arg_names = net.ListArguments();

for (int s=0 ; s != input.size(); s++)
args0[“x”] = NDArray({input[s]},Shape(1,1),ctx0);
args0[“label”] = NDArray({output[s]},Shape(1,1),ctx0);
auto exec0 = net.SimpleBind(ctx0, args0);


  for(int i = 0 ; i != arg_names.size(); i++)
    if (arg_names[i] == "w0")
      opt->Update(i, exec0->arg_arrays[i], exec0->grad_arrays[i]);
    cout << arg_names[i]  << "=" << exec0->arg_arrays[i] << endl;
  delete exec0;

return 1;

@mli in the gradient descent formula (3.1.10) aren’t you always changing all the coefficients together by the same value? shouldn’t this step be performed for each coefficient separately?

Hi @matanper ,

Think that the value of gradient depends on the value of \mathbf{w}, so in every iteration the values of w and b are different, so the gradient is different. I think it would be more clear written this way:

\begin{aligned} \mathbf{w^{t+1}} = \mathbf{w^{t}} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w^{t}}, b^{t}) = \mathbf{w^{t}} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w^{t}}^\top \mathbf{x}^{(i)} + b^{t} - y^{(i)}\right),\\ b^{t+1} = b^{t} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w^{t}}, b^{t}) = b^{t} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w^{t}}^\top \mathbf{x}^{(i)} + b^{t} - y^{(i)}\right). \end{aligned}

This explicitly express the iterative nature of w and b so the values of the gradients change with them too (see that \mathbf{w^{t+1}} and b^{t+1} only depends on values in t). Hopefully this makes more clear for you.

1 Like

Hi everybody,

Does anybody know the answer to the third question?

This is my try but I don’t understand completely the question about the problem of the SGD with the Lapaciand Noise. Thi is my try:

  1. This is actually the Laplace Distribution more than an exponential distribution, and this implies the L1 norm minimization so no closed form solution.


P(\mathbf y|\mathbf X) = \prod_{i=1}^{n} p(y^{(i)}|\mathbf{x}^{(i)}) = \prod_{i=1}^{n} \frac{1}{2} e^{-|\mathbf w^T \mathbf x^{(i)} + b -y^{(i)}|} \\ \Rightarrow L(\mathbf w, b) = -log(P(\mathbf y|\mathbf X)) = log(2^n) + \sum_{i=1}^{n} |\mathbf w^T \mathbf x^{(i)} + b -y^{(i)}|

3.2 SGD for L1 (is actually a subgradient, technically the absolute value is not differentiable) (just assume that sgn(0) = 0 ):

\begin{aligned} \mathbf{w^{t+1}} = \mathbf{w^{t}} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w^{t}}, b^{t}) = \mathbf{w^{t}} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} sgn({\mathbf w^{t}}^{T} x^{(i)} + b^t -y^{(i)}) ,\\ b^{t+1} = b^{t} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w^{t}}, b^{t}) = b^{t} - \eta \ sgn({\mathbf w^{t}}^{T} x^{(i)} + b^t -y^{(i)}). \end{aligned}

What can go wrong besides the same things as with the L2 loss (for example too long learning rate)?

1 Like

3.1 I got the same results from my derivation. Since the first term is constant, L(w, b) reduces to L1 loss minimization
3.3: Subgradient + update as defined by you. Coordinate descent can also be used here. If I remember correctly glmnet uses coordinate descent to solve L1 regularized regression (same as Laplace prior) as it converges faster

Does anyone have solutions for 3.1 exercises? Thanks everyone for replies.