http://d2l.ai/chapter_linearnetworks/linearregression.html
What makes linear regression linear is that we assume that the output truly can be expressed as a linear combination of the input features.
This seems different from what I’ve known: linear regression is linear w.r.t the parameters instead of the inputs.
It is actually the case that linear regression makes the assumption that the output is a linear combination of the input features. This means that your model assumes no other transformations besides multiplication by a scalar and addition is done to any of the input features.
Of course, you can change what you mean by ‘input features’ and apply some other (nonlinear) transformations to the features first, but strictly speaking that is not part of the linear regression. Saying that linear regression is linear w.r.t the parameters is just a short hand way of saying that only linear combinations of the inputs with the parameters as coefficients are allowed.
I have a question about regression using mxnet, I am trying to use it on an expression like this where, w0 and x0 should be scalars:
auto stage1 = w0x0x0;
auto net = LinearRegressionOutput(“linreg”, stage1, Symbol::Variable(“label”));
My problem is that its not converging , I paste the whole code just in case.
#include
#include “mxnetcpp/MxNetCpp.h”
using namespace std;
using namespace mxnet::cpp;
int main(int argc, char** argv)
{
const float learning_rate = 0.01;
vector<mx_float> input =
{
1.0,
3.0,
5.3,
8.0,
6
};
vector<mx_float> output =
{
3.1415,
9.4245,
16.64995,
25.132,
18.849
};
Context ctx0 = Context::cpu();
auto x0 = Symbol::Variable(“x”);
auto w0 = Symbol::Variable(“w0”);
auto stage1 = w0x0x0;
auto net = LinearRegressionOutput(“linreg”, stage1, Symbol::Variable(“label”));
NDArray daInputs = NDArray(input ,Shape(1,input.size()),ctx0);
NDArray daOutputs = NDArray(output,Shape(1,input.size()),ctx0);
std::map<string, NDArray> args0;
Optimizer* opt = OptimizerRegistry::Find(“adam”);
opt>SetParam(“lr”, learning_rate);//>SetParam(“wd”, weight_decay);
int epoch =100000;
auto arg_names = net.ListArguments();
while(epoch–)
{
for (int s=0 ; s != input.size(); s++)
{
args0[“x”] = NDArray({input[s]},Shape(1,1),ctx0);
args0[“label”] = NDArray({output[s]},Shape(1,1),ctx0);
auto exec0 = net.SimpleBind(ctx0, args0);
exec0>Forward(true);
exec0>Backward();
for(int i = 0 ; i != arg_names.size(); i++)
{
if (arg_names[i] == "w0")
opt>Update(i, exec0>arg_arrays[i], exec0>grad_arrays[i]);
cout << arg_names[i] << "=" << exec0>arg_arrays[i] << endl;
}
delete exec0;
}
}
return 1;
@mli in the gradient descent formula (3.1.10) aren’t you always changing all the coefficients together by the same value? shouldn’t this step be performed for each coefficient separately?
Hi @matanper ,
Think that the value of gradient depends on the value of \mathbf{w}, so in every iteration the values of w and b are different, so the gradient is different. I think it would be more clear written this way:
This explicitly express the iterative nature of w and b so the values of the gradients change with them too (see that \mathbf{w^{t+1}} and b^{t+1} only depends on values in t). Hopefully this makes more clear for you.
Hi everybody,
Does anybody know the answer to the third question?
This is my try but I don’t understand completely the question about the problem of the SGD with the Lapaciand Noise. Thi is my try:

This is actually the Laplace Distribution more than an exponential distribution, and this implies the L1 norm minimization so no closed form solution.
3.1
3.2 SGD for L1 (is actually a subgradient, technically the absolute value is not differentiable) (just assume that sgn(0) = 0 ):
What can go wrong besides the same things as with the L2 loss (for example too long learning rate)?
@gpolo
3.1 I got the same results from my derivation. Since the first term is constant, L(w, b) reduces to L1 loss minimization
3.3: Subgradient + update as defined by you. Coordinate descent can also be used here. If I remember correctly glmnet uses coordinate descent to solve L1 regularized regression (same as Laplace prior) as it converges faster