Showing the equivalence between the regularized regression and their constraint formulas using KKTThe proof of equivalent formulas of ridge regressionRidge regression formulation as constrained versus penalized: How are they equivalent?Equivalence between Elastic Net formulationsCalculating $R^2$ for Elastic NetEquivalence between Elastic Net formulationsBridge penalty vs. Elastic Net regularizationRegularized linear regression fails to predict my dataLogistic regression coefficients are wildlyHow to explain differences in formulas of ridge regression, lasso, and elastic netIntuition Behind the Elastic Net PenaltyRegularized Logistic Regression: Lasso vs. Ridge vs. Elastic NetCan you predict the residuals from a regularized regression using the same data?Elastic Net and collinearity

Sort in WP_Query(), not filter? Is it possible?

Send two commands to a new terminal?

How to move the player while also allowing forces to affect it

How can I add custom success page

Check if two datetimes are between two others

System.XmlException: start tag unexpected character =

What is the offset in a seaplane's hull?

Why did the Germans forbid the possession of pet pigeons in Rostov-on-Don in 1941?

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

Add an angle to a sphere

Latin words with no plurals in English

What does 'script /dev/null' do?

Visa needed to visit friends in London

Are white and non-white police officers equally likely to kill black suspects?

How to make payment on the internet without leaving a money trail?

I’m planning on buying a laser printer but concerned about the life cycle of toner in the machine

Finding files for which a command fails

How to create a consistant feel for character names in a fantasy setting?

New order #4: World

How can I fix this gap between bookcases I made?

"listening to me about as much as you're listening to this pole here"

Why do we use polarized capacitors?

Could a US political party gain complete control over the government by removing checks & balances?

cryptic clue: mammal sounds like relative consumer (8)



Showing the equivalence between the regularized regression and their constraint formulas using KKT


The proof of equivalent formulas of ridge regressionRidge regression formulation as constrained versus penalized: How are they equivalent?Equivalence between Elastic Net formulationsCalculating $R^2$ for Elastic NetEquivalence between Elastic Net formulationsBridge penalty vs. Elastic Net regularizationRegularized linear regression fails to predict my dataLogistic regression coefficients are wildlyHow to explain differences in formulas of ridge regression, lasso, and elastic netIntuition Behind the Elastic Net PenaltyRegularized Logistic Regression: Lasso vs. Ridge vs. Elastic NetCan you predict the residuals from a regularized regression using the same data?Elastic Net and collinearity






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








7












$begingroup$


According to the references Book 1, Book 2 and paper.



It has been mentioned that there is an equivalence between the regularized regression (Ridge, LASSO and Elastic Net) and their constraint formulas.



I have also looked at Cross Validated 1, and Cross Validated 2, but I can not see a clear answer show that equivalence or logic.



My question is



How to show that equivalence using Karush–Kuhn–Tucker (KKT)?



The following formulas are for Ridge regression.



Ridge



The following formulas are for LASSO regression.



|LASSO



The following formulas are for Elastic Net regression.



Elastic Net



NOTE



This question is not homework. It is only to increase my comprehension of this topic.










share|cite|improve this question











$endgroup$





This question has an open bounty worth +50
reputation from jeza ending ending at 2019-04-13 16:45:13Z">in 5 days.


This question has not received enough attention.


Required detailed answer step by step with a practical example.






















    7












    $begingroup$


    According to the references Book 1, Book 2 and paper.



    It has been mentioned that there is an equivalence between the regularized regression (Ridge, LASSO and Elastic Net) and their constraint formulas.



    I have also looked at Cross Validated 1, and Cross Validated 2, but I can not see a clear answer show that equivalence or logic.



    My question is



    How to show that equivalence using Karush–Kuhn–Tucker (KKT)?



    The following formulas are for Ridge regression.



    Ridge



    The following formulas are for LASSO regression.



    |LASSO



    The following formulas are for Elastic Net regression.



    Elastic Net



    NOTE



    This question is not homework. It is only to increase my comprehension of this topic.










    share|cite|improve this question











    $endgroup$





    This question has an open bounty worth +50
    reputation from jeza ending ending at 2019-04-13 16:45:13Z">in 5 days.


    This question has not received enough attention.


    Required detailed answer step by step with a practical example.


















      7












      7








      7


      3



      $begingroup$


      According to the references Book 1, Book 2 and paper.



      It has been mentioned that there is an equivalence between the regularized regression (Ridge, LASSO and Elastic Net) and their constraint formulas.



      I have also looked at Cross Validated 1, and Cross Validated 2, but I can not see a clear answer show that equivalence or logic.



      My question is



      How to show that equivalence using Karush–Kuhn–Tucker (KKT)?



      The following formulas are for Ridge regression.



      Ridge



      The following formulas are for LASSO regression.



      |LASSO



      The following formulas are for Elastic Net regression.



      Elastic Net



      NOTE



      This question is not homework. It is only to increase my comprehension of this topic.










      share|cite|improve this question











      $endgroup$




      According to the references Book 1, Book 2 and paper.



      It has been mentioned that there is an equivalence between the regularized regression (Ridge, LASSO and Elastic Net) and their constraint formulas.



      I have also looked at Cross Validated 1, and Cross Validated 2, but I can not see a clear answer show that equivalence or logic.



      My question is



      How to show that equivalence using Karush–Kuhn–Tucker (KKT)?



      The following formulas are for Ridge regression.



      Ridge



      The following formulas are for LASSO regression.



      |LASSO



      The following formulas are for Elastic Net regression.



      Elastic Net



      NOTE



      This question is not homework. It is only to increase my comprehension of this topic.







      regression optimization lasso ridge-regression elastic-net






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited yesterday







      jeza

















      asked Apr 4 at 16:05









      jezajeza

      425420




      425420






      This question has an open bounty worth +50
      reputation from jeza ending ending at 2019-04-13 16:45:13Z">in 5 days.


      This question has not received enough attention.


      Required detailed answer step by step with a practical example.








      This question has an open bounty worth +50
      reputation from jeza ending ending at 2019-04-13 16:45:13Z">in 5 days.


      This question has not received enough attention.


      Required detailed answer step by step with a practical example.






















          1 Answer
          1






          active

          oldest

          votes


















          7












          $begingroup$

          The more technical answer is because the constrained optimization problem can be written in terms of Lagrange multipliers. In particular, the Lagrangian associated with the constrained optimization problem is given by
          $$mathcal L(beta) = undersetbetamathrmargmin,leftsum_i=1^N left(y_i - sum_j=1^p x_ij beta_jright)^2right + mu leftbeta_j$$
          where $mu$ is a multiplier chosen to satisfy the constraints of the problem. The first order conditions (which are sufficient since you are working with nice proper convex functions) for this optimization problem can thus be obtained by differentiating the Lagrangian with respect to $beta$ and setting the derivatives equal to 0 (it's a bit more nuanced since the LASSO part has undifferentiable points, but there are methods from convex analysis to generalize the derivative to make the first order condition still work). It is clear that these first order conditions are identical to the first order conditions of the unconstrained problem you wrote down.



          However, I think it's useful to see why in general, with these optimization problems, it is often possible to think about the problem either through the lens of a constrained optimization problem or through the lens of an unconstrained problem. More concretely, suppose we have an unconstrained optimization problem of the following form:
          $$max_x f(x) + lambda g(x)$$
          We can always try to solve this optimization directly, but sometimes, it might make sense to break this problem into subcomponents. In particular, it is not hard to see that
          $$max_x f(x) + lambda g(x) = max_t left(max_x f(x) mathrm s.t g(x) = tright) + lambda t$$
          So for a fixed value of $lambda$ (and assuming the functions to be optimized actually achieve their optima), we can associate with it a value $t^*$ that solves the outer optimization problem. This gives us a sort of mapping from unconstrained optimization problems to constrained problems. In your particular setting, since everything is nicely behaved for elastic net regression, this mapping should in fact be one to one, so it will be useful to be able to switch between these two contexts depending on which is more useful to a particular application. In general, this relationship between constrained and unconstrained problems may be less well behaved, but it may still be useful to think about to what extent you can move between the constrained and unconstrained problem.



          Edit: As requested, I will include a more concrete analysis for ridge regression, since it captures the main ideas while avoiding having to deal with the technicalities associated with the non-differentiability of the LASSO penalty. Recall, we are solving optimization problem (in matrix notation):



          $$undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betarightquadmathrms.t., ||beta||^2 leq M$$



          Let $beta^OLS$ be the OLS solution (i.e. when there is no constraint). Then I will focus on the case where $M < left|left|beta^OLSright|right|$ (provided this exists) since otherwise, the constraint is uninteresting since it does not bind. The Lagrangian for this problem can be written
          $$mathcal L(beta) = undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betaright - mucdot||beta||^2 leq M$$
          Then differentiating, we get first order conditions:
          $$0 = -2 left(sum_i=1^N y_i x_i + left(sum_i=1^N x_i x_i^T + mu Iright) betaright)$$
          which is just a system of linear equations and hence can be solved:
          $$hatbeta = left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)$$
          for some choice of multiplier $mu$. The multiplier is then simply chosen to make the constraint true, i.e. we need



          $$left(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right)^Tleft(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right) = M$$
          which exists since the LHS is monotonic in $mu$. This equation gives an explicit mapping from multipliers $mu in (0,infty)$ to constraints, $M in left(0, left|left|beta^OLSright|right|right)$ with
          $$lim_muto 0 M(mu) = left|left|beta^OLSright|right|$$
          when the RHS exists and
          $$lim_mu to infty M(mu) = 0$$
          This mapping actually corresponds to something quite intuitive. The envelope theorem tells us that $mu(M)$ corresponds to the marginal decrease in error we get from a small relaxation of the constraint $M$. This explains why when $mu to 0$ corresponds to $M to left|right|beta^OLSleft|right|$. Once the constraint is not binding, there is no value in relaxing it any more, which is why the multiplier vanishes.






          share|cite|improve this answer











          $endgroup$












          • $begingroup$
            could you please provide us with a detailed answer step by step with a practical example if that possible.
            $endgroup$
            – jeza
            21 hours ago










          • $begingroup$
            many thanks, why you do not mention KKT? I am not familiar with this area, so treat me as a high school student.
            $endgroup$
            – jeza
            6 hours ago










          • $begingroup$
            The KKT conditions in this case are a generalization of the “first order conditions” I mention by differentiating the Lagrangian and setting the derivative equal to 0. Since in this example, the constraints hold with equality, we don’t need the KKT conditions in full generally. In more complicated cases, all that happens is that some of the equalities above become inequalities and the multiplier becomes 0 for constraints become non binding . For example, this is exactly what happens when $M > ||beta^OLS||$ in the above.
            $endgroup$
            – stats_model
            3 hours ago












          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401212%2fshowing-the-equivalence-between-the-regularized-regression-and-their-constraint%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          7












          $begingroup$

          The more technical answer is because the constrained optimization problem can be written in terms of Lagrange multipliers. In particular, the Lagrangian associated with the constrained optimization problem is given by
          $$mathcal L(beta) = undersetbetamathrmargmin,leftsum_i=1^N left(y_i - sum_j=1^p x_ij beta_jright)^2right + mu leftbeta_j$$
          where $mu$ is a multiplier chosen to satisfy the constraints of the problem. The first order conditions (which are sufficient since you are working with nice proper convex functions) for this optimization problem can thus be obtained by differentiating the Lagrangian with respect to $beta$ and setting the derivatives equal to 0 (it's a bit more nuanced since the LASSO part has undifferentiable points, but there are methods from convex analysis to generalize the derivative to make the first order condition still work). It is clear that these first order conditions are identical to the first order conditions of the unconstrained problem you wrote down.



          However, I think it's useful to see why in general, with these optimization problems, it is often possible to think about the problem either through the lens of a constrained optimization problem or through the lens of an unconstrained problem. More concretely, suppose we have an unconstrained optimization problem of the following form:
          $$max_x f(x) + lambda g(x)$$
          We can always try to solve this optimization directly, but sometimes, it might make sense to break this problem into subcomponents. In particular, it is not hard to see that
          $$max_x f(x) + lambda g(x) = max_t left(max_x f(x) mathrm s.t g(x) = tright) + lambda t$$
          So for a fixed value of $lambda$ (and assuming the functions to be optimized actually achieve their optima), we can associate with it a value $t^*$ that solves the outer optimization problem. This gives us a sort of mapping from unconstrained optimization problems to constrained problems. In your particular setting, since everything is nicely behaved for elastic net regression, this mapping should in fact be one to one, so it will be useful to be able to switch between these two contexts depending on which is more useful to a particular application. In general, this relationship between constrained and unconstrained problems may be less well behaved, but it may still be useful to think about to what extent you can move between the constrained and unconstrained problem.



          Edit: As requested, I will include a more concrete analysis for ridge regression, since it captures the main ideas while avoiding having to deal with the technicalities associated with the non-differentiability of the LASSO penalty. Recall, we are solving optimization problem (in matrix notation):



          $$undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betarightquadmathrms.t., ||beta||^2 leq M$$



          Let $beta^OLS$ be the OLS solution (i.e. when there is no constraint). Then I will focus on the case where $M < left|left|beta^OLSright|right|$ (provided this exists) since otherwise, the constraint is uninteresting since it does not bind. The Lagrangian for this problem can be written
          $$mathcal L(beta) = undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betaright - mucdot||beta||^2 leq M$$
          Then differentiating, we get first order conditions:
          $$0 = -2 left(sum_i=1^N y_i x_i + left(sum_i=1^N x_i x_i^T + mu Iright) betaright)$$
          which is just a system of linear equations and hence can be solved:
          $$hatbeta = left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)$$
          for some choice of multiplier $mu$. The multiplier is then simply chosen to make the constraint true, i.e. we need



          $$left(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right)^Tleft(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right) = M$$
          which exists since the LHS is monotonic in $mu$. This equation gives an explicit mapping from multipliers $mu in (0,infty)$ to constraints, $M in left(0, left|left|beta^OLSright|right|right)$ with
          $$lim_muto 0 M(mu) = left|left|beta^OLSright|right|$$
          when the RHS exists and
          $$lim_mu to infty M(mu) = 0$$
          This mapping actually corresponds to something quite intuitive. The envelope theorem tells us that $mu(M)$ corresponds to the marginal decrease in error we get from a small relaxation of the constraint $M$. This explains why when $mu to 0$ corresponds to $M to left|right|beta^OLSleft|right|$. Once the constraint is not binding, there is no value in relaxing it any more, which is why the multiplier vanishes.






          share|cite|improve this answer











          $endgroup$












          • $begingroup$
            could you please provide us with a detailed answer step by step with a practical example if that possible.
            $endgroup$
            – jeza
            21 hours ago










          • $begingroup$
            many thanks, why you do not mention KKT? I am not familiar with this area, so treat me as a high school student.
            $endgroup$
            – jeza
            6 hours ago










          • $begingroup$
            The KKT conditions in this case are a generalization of the “first order conditions” I mention by differentiating the Lagrangian and setting the derivative equal to 0. Since in this example, the constraints hold with equality, we don’t need the KKT conditions in full generally. In more complicated cases, all that happens is that some of the equalities above become inequalities and the multiplier becomes 0 for constraints become non binding . For example, this is exactly what happens when $M > ||beta^OLS||$ in the above.
            $endgroup$
            – stats_model
            3 hours ago
















          7












          $begingroup$

          The more technical answer is because the constrained optimization problem can be written in terms of Lagrange multipliers. In particular, the Lagrangian associated with the constrained optimization problem is given by
          $$mathcal L(beta) = undersetbetamathrmargmin,leftsum_i=1^N left(y_i - sum_j=1^p x_ij beta_jright)^2right + mu leftbeta_j$$
          where $mu$ is a multiplier chosen to satisfy the constraints of the problem. The first order conditions (which are sufficient since you are working with nice proper convex functions) for this optimization problem can thus be obtained by differentiating the Lagrangian with respect to $beta$ and setting the derivatives equal to 0 (it's a bit more nuanced since the LASSO part has undifferentiable points, but there are methods from convex analysis to generalize the derivative to make the first order condition still work). It is clear that these first order conditions are identical to the first order conditions of the unconstrained problem you wrote down.



          However, I think it's useful to see why in general, with these optimization problems, it is often possible to think about the problem either through the lens of a constrained optimization problem or through the lens of an unconstrained problem. More concretely, suppose we have an unconstrained optimization problem of the following form:
          $$max_x f(x) + lambda g(x)$$
          We can always try to solve this optimization directly, but sometimes, it might make sense to break this problem into subcomponents. In particular, it is not hard to see that
          $$max_x f(x) + lambda g(x) = max_t left(max_x f(x) mathrm s.t g(x) = tright) + lambda t$$
          So for a fixed value of $lambda$ (and assuming the functions to be optimized actually achieve their optima), we can associate with it a value $t^*$ that solves the outer optimization problem. This gives us a sort of mapping from unconstrained optimization problems to constrained problems. In your particular setting, since everything is nicely behaved for elastic net regression, this mapping should in fact be one to one, so it will be useful to be able to switch between these two contexts depending on which is more useful to a particular application. In general, this relationship between constrained and unconstrained problems may be less well behaved, but it may still be useful to think about to what extent you can move between the constrained and unconstrained problem.



          Edit: As requested, I will include a more concrete analysis for ridge regression, since it captures the main ideas while avoiding having to deal with the technicalities associated with the non-differentiability of the LASSO penalty. Recall, we are solving optimization problem (in matrix notation):



          $$undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betarightquadmathrms.t., ||beta||^2 leq M$$



          Let $beta^OLS$ be the OLS solution (i.e. when there is no constraint). Then I will focus on the case where $M < left|left|beta^OLSright|right|$ (provided this exists) since otherwise, the constraint is uninteresting since it does not bind. The Lagrangian for this problem can be written
          $$mathcal L(beta) = undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betaright - mucdot||beta||^2 leq M$$
          Then differentiating, we get first order conditions:
          $$0 = -2 left(sum_i=1^N y_i x_i + left(sum_i=1^N x_i x_i^T + mu Iright) betaright)$$
          which is just a system of linear equations and hence can be solved:
          $$hatbeta = left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)$$
          for some choice of multiplier $mu$. The multiplier is then simply chosen to make the constraint true, i.e. we need



          $$left(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right)^Tleft(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right) = M$$
          which exists since the LHS is monotonic in $mu$. This equation gives an explicit mapping from multipliers $mu in (0,infty)$ to constraints, $M in left(0, left|left|beta^OLSright|right|right)$ with
          $$lim_muto 0 M(mu) = left|left|beta^OLSright|right|$$
          when the RHS exists and
          $$lim_mu to infty M(mu) = 0$$
          This mapping actually corresponds to something quite intuitive. The envelope theorem tells us that $mu(M)$ corresponds to the marginal decrease in error we get from a small relaxation of the constraint $M$. This explains why when $mu to 0$ corresponds to $M to left|right|beta^OLSleft|right|$. Once the constraint is not binding, there is no value in relaxing it any more, which is why the multiplier vanishes.






          share|cite|improve this answer











          $endgroup$












          • $begingroup$
            could you please provide us with a detailed answer step by step with a practical example if that possible.
            $endgroup$
            – jeza
            21 hours ago










          • $begingroup$
            many thanks, why you do not mention KKT? I am not familiar with this area, so treat me as a high school student.
            $endgroup$
            – jeza
            6 hours ago










          • $begingroup$
            The KKT conditions in this case are a generalization of the “first order conditions” I mention by differentiating the Lagrangian and setting the derivative equal to 0. Since in this example, the constraints hold with equality, we don’t need the KKT conditions in full generally. In more complicated cases, all that happens is that some of the equalities above become inequalities and the multiplier becomes 0 for constraints become non binding . For example, this is exactly what happens when $M > ||beta^OLS||$ in the above.
            $endgroup$
            – stats_model
            3 hours ago














          7












          7








          7





          $begingroup$

          The more technical answer is because the constrained optimization problem can be written in terms of Lagrange multipliers. In particular, the Lagrangian associated with the constrained optimization problem is given by
          $$mathcal L(beta) = undersetbetamathrmargmin,leftsum_i=1^N left(y_i - sum_j=1^p x_ij beta_jright)^2right + mu leftbeta_j$$
          where $mu$ is a multiplier chosen to satisfy the constraints of the problem. The first order conditions (which are sufficient since you are working with nice proper convex functions) for this optimization problem can thus be obtained by differentiating the Lagrangian with respect to $beta$ and setting the derivatives equal to 0 (it's a bit more nuanced since the LASSO part has undifferentiable points, but there are methods from convex analysis to generalize the derivative to make the first order condition still work). It is clear that these first order conditions are identical to the first order conditions of the unconstrained problem you wrote down.



          However, I think it's useful to see why in general, with these optimization problems, it is often possible to think about the problem either through the lens of a constrained optimization problem or through the lens of an unconstrained problem. More concretely, suppose we have an unconstrained optimization problem of the following form:
          $$max_x f(x) + lambda g(x)$$
          We can always try to solve this optimization directly, but sometimes, it might make sense to break this problem into subcomponents. In particular, it is not hard to see that
          $$max_x f(x) + lambda g(x) = max_t left(max_x f(x) mathrm s.t g(x) = tright) + lambda t$$
          So for a fixed value of $lambda$ (and assuming the functions to be optimized actually achieve their optima), we can associate with it a value $t^*$ that solves the outer optimization problem. This gives us a sort of mapping from unconstrained optimization problems to constrained problems. In your particular setting, since everything is nicely behaved for elastic net regression, this mapping should in fact be one to one, so it will be useful to be able to switch between these two contexts depending on which is more useful to a particular application. In general, this relationship between constrained and unconstrained problems may be less well behaved, but it may still be useful to think about to what extent you can move between the constrained and unconstrained problem.



          Edit: As requested, I will include a more concrete analysis for ridge regression, since it captures the main ideas while avoiding having to deal with the technicalities associated with the non-differentiability of the LASSO penalty. Recall, we are solving optimization problem (in matrix notation):



          $$undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betarightquadmathrms.t., ||beta||^2 leq M$$



          Let $beta^OLS$ be the OLS solution (i.e. when there is no constraint). Then I will focus on the case where $M < left|left|beta^OLSright|right|$ (provided this exists) since otherwise, the constraint is uninteresting since it does not bind. The Lagrangian for this problem can be written
          $$mathcal L(beta) = undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betaright - mucdot||beta||^2 leq M$$
          Then differentiating, we get first order conditions:
          $$0 = -2 left(sum_i=1^N y_i x_i + left(sum_i=1^N x_i x_i^T + mu Iright) betaright)$$
          which is just a system of linear equations and hence can be solved:
          $$hatbeta = left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)$$
          for some choice of multiplier $mu$. The multiplier is then simply chosen to make the constraint true, i.e. we need



          $$left(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right)^Tleft(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right) = M$$
          which exists since the LHS is monotonic in $mu$. This equation gives an explicit mapping from multipliers $mu in (0,infty)$ to constraints, $M in left(0, left|left|beta^OLSright|right|right)$ with
          $$lim_muto 0 M(mu) = left|left|beta^OLSright|right|$$
          when the RHS exists and
          $$lim_mu to infty M(mu) = 0$$
          This mapping actually corresponds to something quite intuitive. The envelope theorem tells us that $mu(M)$ corresponds to the marginal decrease in error we get from a small relaxation of the constraint $M$. This explains why when $mu to 0$ corresponds to $M to left|right|beta^OLSleft|right|$. Once the constraint is not binding, there is no value in relaxing it any more, which is why the multiplier vanishes.






          share|cite|improve this answer











          $endgroup$



          The more technical answer is because the constrained optimization problem can be written in terms of Lagrange multipliers. In particular, the Lagrangian associated with the constrained optimization problem is given by
          $$mathcal L(beta) = undersetbetamathrmargmin,leftsum_i=1^N left(y_i - sum_j=1^p x_ij beta_jright)^2right + mu leftbeta_j$$
          where $mu$ is a multiplier chosen to satisfy the constraints of the problem. The first order conditions (which are sufficient since you are working with nice proper convex functions) for this optimization problem can thus be obtained by differentiating the Lagrangian with respect to $beta$ and setting the derivatives equal to 0 (it's a bit more nuanced since the LASSO part has undifferentiable points, but there are methods from convex analysis to generalize the derivative to make the first order condition still work). It is clear that these first order conditions are identical to the first order conditions of the unconstrained problem you wrote down.



          However, I think it's useful to see why in general, with these optimization problems, it is often possible to think about the problem either through the lens of a constrained optimization problem or through the lens of an unconstrained problem. More concretely, suppose we have an unconstrained optimization problem of the following form:
          $$max_x f(x) + lambda g(x)$$
          We can always try to solve this optimization directly, but sometimes, it might make sense to break this problem into subcomponents. In particular, it is not hard to see that
          $$max_x f(x) + lambda g(x) = max_t left(max_x f(x) mathrm s.t g(x) = tright) + lambda t$$
          So for a fixed value of $lambda$ (and assuming the functions to be optimized actually achieve their optima), we can associate with it a value $t^*$ that solves the outer optimization problem. This gives us a sort of mapping from unconstrained optimization problems to constrained problems. In your particular setting, since everything is nicely behaved for elastic net regression, this mapping should in fact be one to one, so it will be useful to be able to switch between these two contexts depending on which is more useful to a particular application. In general, this relationship between constrained and unconstrained problems may be less well behaved, but it may still be useful to think about to what extent you can move between the constrained and unconstrained problem.



          Edit: As requested, I will include a more concrete analysis for ridge regression, since it captures the main ideas while avoiding having to deal with the technicalities associated with the non-differentiability of the LASSO penalty. Recall, we are solving optimization problem (in matrix notation):



          $$undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betarightquadmathrms.t., ||beta||^2 leq M$$



          Let $beta^OLS$ be the OLS solution (i.e. when there is no constraint). Then I will focus on the case where $M < left|left|beta^OLSright|right|$ (provided this exists) since otherwise, the constraint is uninteresting since it does not bind. The Lagrangian for this problem can be written
          $$mathcal L(beta) = undersetbetamathrmargmin leftsum_i=1^N y_i - x_i^T betaright - mucdot||beta||^2 leq M$$
          Then differentiating, we get first order conditions:
          $$0 = -2 left(sum_i=1^N y_i x_i + left(sum_i=1^N x_i x_i^T + mu Iright) betaright)$$
          which is just a system of linear equations and hence can be solved:
          $$hatbeta = left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)$$
          for some choice of multiplier $mu$. The multiplier is then simply chosen to make the constraint true, i.e. we need



          $$left(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right)^Tleft(left(sum_i=1^N x_i x_i^T + mu Iright)^-1left(sum_i=1^N y_i x_iright)right) = M$$
          which exists since the LHS is monotonic in $mu$. This equation gives an explicit mapping from multipliers $mu in (0,infty)$ to constraints, $M in left(0, left|left|beta^OLSright|right|right)$ with
          $$lim_muto 0 M(mu) = left|left|beta^OLSright|right|$$
          when the RHS exists and
          $$lim_mu to infty M(mu) = 0$$
          This mapping actually corresponds to something quite intuitive. The envelope theorem tells us that $mu(M)$ corresponds to the marginal decrease in error we get from a small relaxation of the constraint $M$. This explains why when $mu to 0$ corresponds to $M to left|right|beta^OLSleft|right|$. Once the constraint is not binding, there is no value in relaxing it any more, which is why the multiplier vanishes.







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited 17 hours ago

























          answered Apr 4 at 16:34









          stats_modelstats_model

          21417




          21417











          • $begingroup$
            could you please provide us with a detailed answer step by step with a practical example if that possible.
            $endgroup$
            – jeza
            21 hours ago










          • $begingroup$
            many thanks, why you do not mention KKT? I am not familiar with this area, so treat me as a high school student.
            $endgroup$
            – jeza
            6 hours ago










          • $begingroup$
            The KKT conditions in this case are a generalization of the “first order conditions” I mention by differentiating the Lagrangian and setting the derivative equal to 0. Since in this example, the constraints hold with equality, we don’t need the KKT conditions in full generally. In more complicated cases, all that happens is that some of the equalities above become inequalities and the multiplier becomes 0 for constraints become non binding . For example, this is exactly what happens when $M > ||beta^OLS||$ in the above.
            $endgroup$
            – stats_model
            3 hours ago

















          • $begingroup$
            could you please provide us with a detailed answer step by step with a practical example if that possible.
            $endgroup$
            – jeza
            21 hours ago










          • $begingroup$
            many thanks, why you do not mention KKT? I am not familiar with this area, so treat me as a high school student.
            $endgroup$
            – jeza
            6 hours ago










          • $begingroup$
            The KKT conditions in this case are a generalization of the “first order conditions” I mention by differentiating the Lagrangian and setting the derivative equal to 0. Since in this example, the constraints hold with equality, we don’t need the KKT conditions in full generally. In more complicated cases, all that happens is that some of the equalities above become inequalities and the multiplier becomes 0 for constraints become non binding . For example, this is exactly what happens when $M > ||beta^OLS||$ in the above.
            $endgroup$
            – stats_model
            3 hours ago
















          $begingroup$
          could you please provide us with a detailed answer step by step with a practical example if that possible.
          $endgroup$
          – jeza
          21 hours ago




          $begingroup$
          could you please provide us with a detailed answer step by step with a practical example if that possible.
          $endgroup$
          – jeza
          21 hours ago












          $begingroup$
          many thanks, why you do not mention KKT? I am not familiar with this area, so treat me as a high school student.
          $endgroup$
          – jeza
          6 hours ago




          $begingroup$
          many thanks, why you do not mention KKT? I am not familiar with this area, so treat me as a high school student.
          $endgroup$
          – jeza
          6 hours ago












          $begingroup$
          The KKT conditions in this case are a generalization of the “first order conditions” I mention by differentiating the Lagrangian and setting the derivative equal to 0. Since in this example, the constraints hold with equality, we don’t need the KKT conditions in full generally. In more complicated cases, all that happens is that some of the equalities above become inequalities and the multiplier becomes 0 for constraints become non binding . For example, this is exactly what happens when $M > ||beta^OLS||$ in the above.
          $endgroup$
          – stats_model
          3 hours ago





          $begingroup$
          The KKT conditions in this case are a generalization of the “first order conditions” I mention by differentiating the Lagrangian and setting the derivative equal to 0. Since in this example, the constraints hold with equality, we don’t need the KKT conditions in full generally. In more complicated cases, all that happens is that some of the equalities above become inequalities and the multiplier becomes 0 for constraints become non binding . For example, this is exactly what happens when $M > ||beta^OLS||$ in the above.
          $endgroup$
          – stats_model
          3 hours ago


















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Cross Validated!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401212%2fshowing-the-equivalence-between-the-regularized-regression-and-their-constraint%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to write a 12-bar blues melodyI-IV-V blues progressionHow to play the bridges in a standard blues progressionHow does Gdim7 fit in C# minor?question on a certain chord progressionMusicology of Melody12 bar blues, spread rhythm: alternative to 6th chord to avoid finger stretchChord progressions/ Root key/ MelodiesHow to put chords (POP-EDM) under a given lead vocal melody (starting from a good knowledge in music theory)Are there “rules” for improvising with the minor pentatonic scale over 12-bar shuffle?Confusion about blues scale and chords

          What if the end-user didn't have the required library?What is setup.py?What is a clean, pythonic way to have multiple constructors in Python?What does Ruby have that Python doesn't, and vice versa?What is the reason for having '//' in Python?How do I create a namespace package in Python?How to package shared objects that python modules depend on?setuptools vs. distutils: why is distutils still a thing?Navigation in Windows 10 vs code not going to virtualenv library when the same library is installed at user levelPython create package for local usePackaging a project that uses multiple python versionsWhy is permission denied on pip install except for when “--user” is included at end of command?

          Esgonzo ibérico Índice Descrición Distribución Hábitat Ameazas Notas Véxase tamén "Acerca dos nomes dos anfibios e réptiles galegos""Chalcides bedriagai"Chalcides bedriagai en Carrascal, L. M. Salvador, A. (Eds). Enciclopedia virtual de los vertebrados españoles. Museo Nacional de Ciencias Naturales, Madrid. España.Fotos