AIC for increasing sample size The 2019 Stack Overflow Developer Survey Results Are InPositive log likelihood values and penalty of more complex models when ranking models using AICUsing AIC, for model selection when both models are equally weighted, and one model has fewer parametersSparse parameters when computing AIC, BIC, etcAIC, BIC and GCV: what is best for making decision in penalized regression methods?Comparison of log-likelihood of two non-nested modelsAIC, model selection and overfittingCan we use AIC to compare two GLMs when the scale parameter is estimated separately for each?Determination of maximum log-likelihood of nonlinear model for calculation of Aikaike ICAIC formula in R vs PythonAIC Calculation using log likelihood

JSON.serialize: is it possible to suppress null values of a map?

Access elements in std::string where positon of string is greater than its size

Inflated grade on resume at previous job, might former employer tell new employer?

How was Skylab's orbit inclination chosen?

Is flight data recorder erased after every flight?

Lethal sonic weapons

What does "sndry explns" mean in one of the Hitchhiker's guide books?

How to reverse every other sublist of a list?

Time travel alters history but people keep saying nothing's changed

Is domain driven design an anti-SQL pattern?

Idomatic way to prevent slicing?

Are there any other methods to apply to solving simultaneous equations?

Deadlock Graph and Interpretation, solution to avoid

Should I write numbers in words or as symbols in this case?

What are the motivations for publishing new editions of an existing textbook, beyond new discoveries in a field?

"To split hairs" vs "To be pedantic"

What is the best strategy for white in this position?

Which Sci-Fi work first showed weapon of galactic-scale mass destruction?

Inline version of a function returns different value then non-inline version

Falsification in Math vs Science

Limit the amount of RAM Mathematica may access?

Why is the maximum length of openwrt’s root password 8 characters?

Does it makes sense to buy a new cycle to learn riding?

Is this food a bread or a loaf?

AIC for increasing sample size

The 2019 Stack Overflow Developer Survey Results Are InPositive log likelihood values and penalty of more complex models when ranking models using AICUsing AIC, for model selection when both models are equally weighted, and one model has fewer parametersSparse parameters when computing AIC, BIC, etcAIC, BIC and GCV: what is best for making decision in penalized regression methods?Comparison of log-likelihood of two non-nested modelsAIC, model selection and overfittingCan we use AIC to compare two GLMs when the scale parameter is estimated separately for each?Determination of maximum log-likelihood of nonlinear model for calculation of Aikaike ICAIC formula in R vs PythonAIC Calculation using log likelihood

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am using AIC as a model selection criteria in one of my projects. However, since AIC isn't dependent on the number of points sampled, for large n the log likelihood term rapidly outscales the parameter penalty.

I was wondering why the parameter penalty doesn't scale with the number of points, as the log likelihood generally does. It's getting to where the log likelihood is in the order of tens of thousands and the AIC penalty for having ~10 extra parameters in the model doesn't matter. But it feels like it really should. Am I misunderstanding something?

edited Apr 5 at 12:37

Richard Hardy

28.2k642129

asked Apr 5 at 12:20

Jason

183

New contributor

$begingroup$
Why would having 10 extra parameters matter if you have enough data to estimate them rather precisely? AIC/n (AIC per datapoint) estimates the log-likelihood of a new data point from the same population; when you have enough data, this is approximately equal to the average sample likelihood (log-likelihood/n) as the estimation error for the parameters is negligible.
$endgroup$
– Richard Hardy
Apr 5 at 13:19

$begingroup$
Sorry, I don't think I articulated my question very well. Let's say you have many points of somewhat noisy data. Adding a decent number of parameters (lets stay 10) to your model will likely be very beneficial to your log likelihood. However, the -2k part of the AIC calculation will barely penalize the model for it. It just seems to me that the AIC doesn't appropriately penalize for extra params.
$endgroup$
– Jason
Apr 5 at 13:51

$begingroup$
In my comment above, it should be negative likelihood, not raw likelihood.
$endgroup$
– Richard Hardy
Apr 5 at 15:22

add a comment |

edited Apr 5 at 12:37

Richard Hardy

28.2k642129

asked Apr 5 at 12:20

Jason

183

New contributor

$begingroup$
Why would having 10 extra parameters matter if you have enough data to estimate them rather precisely? AIC/n (AIC per datapoint) estimates the log-likelihood of a new data point from the same population; when you have enough data, this is approximately equal to the average sample likelihood (log-likelihood/n) as the estimation error for the parameters is negligible.
$endgroup$
– Richard Hardy
Apr 5 at 13:19

$begingroup$
Sorry, I don't think I articulated my question very well. Let's say you have many points of somewhat noisy data. Adding a decent number of parameters (lets stay 10) to your model will likely be very beneficial to your log likelihood. However, the -2k part of the AIC calculation will barely penalize the model for it. It just seems to me that the AIC doesn't appropriately penalize for extra params.
$endgroup$
– Jason
Apr 5 at 13:51

$begingroup$
In my comment above, it should be negative likelihood, not raw likelihood.
$endgroup$
– Richard Hardy
Apr 5 at 15:22

add a comment |

edited Apr 5 at 12:37

Richard Hardy

28.2k642129

asked Apr 5 at 12:20

Jason

183

New contributor

model-selection aic asymptotics log-likelihood

edited Apr 5 at 12:37

Richard Hardy

28.2k642129

asked Apr 5 at 12:20

Jason

183

New contributor

edited Apr 5 at 12:37

Richard Hardy

28.2k642129

asked Apr 5 at 12:20

Jason

183

New contributor

edited Apr 5 at 12:37

Richard Hardy

28.2k642129

edited Apr 5 at 12:37

Richard Hardy

28.2k642129

edited Apr 5 at 12:37

Richard Hardy

28.2k642129

asked Apr 5 at 12:20

Jason

183

New contributor

asked Apr 5 at 12:20

Jason

183

asked Apr 5 at 12:20

Jason

183

New contributor

Jason is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Why would having 10 extra parameters matter if you have enough data to estimate them rather precisely? AIC/n (AIC per datapoint) estimates the log-likelihood of a new data point from the same population; when you have enough data, this is approximately equal to the average sample likelihood (log-likelihood/n) as the estimation error for the parameters is negligible.
$endgroup$
– Richard Hardy
Apr 5 at 13:19

$begingroup$
Sorry, I don't think I articulated my question very well. Let's say you have many points of somewhat noisy data. Adding a decent number of parameters (lets stay 10) to your model will likely be very beneficial to your log likelihood. However, the -2k part of the AIC calculation will barely penalize the model for it. It just seems to me that the AIC doesn't appropriately penalize for extra params.
$endgroup$
– Jason
Apr 5 at 13:51

$begingroup$
In my comment above, it should be negative likelihood, not raw likelihood.
$endgroup$
– Richard Hardy
Apr 5 at 15:22

add a comment |

$begingroup$
Why would having 10 extra parameters matter if you have enough data to estimate them rather precisely? AIC/n (AIC per datapoint) estimates the log-likelihood of a new data point from the same population; when you have enough data, this is approximately equal to the average sample likelihood (log-likelihood/n) as the estimation error for the parameters is negligible.
$endgroup$
– Richard Hardy
Apr 5 at 13:19

$begingroup$
Sorry, I don't think I articulated my question very well. Let's say you have many points of somewhat noisy data. Adding a decent number of parameters (lets stay 10) to your model will likely be very beneficial to your log likelihood. However, the -2k part of the AIC calculation will barely penalize the model for it. It just seems to me that the AIC doesn't appropriately penalize for extra params.
$endgroup$
– Jason
Apr 5 at 13:51

$begingroup$
In my comment above, it should be negative likelihood, not raw likelihood.
$endgroup$
– Richard Hardy
Apr 5 at 15:22

Why would having 10 extra parameters matter if you have enough data to estimate them rather precisely? AIC/n (AIC per datapoint) estimates the log-likelihood of a new data point from the same population; when you have enough data, this is approximately equal to the average sample likelihood (log-likelihood/n) as the estimation error for the parameters is negligible.

– Richard Hardy
Apr 5 at 13:19

Sorry, I don't think I articulated my question very well. Let's say you have many points of somewhat noisy data. Adding a decent number of parameters (lets stay 10) to your model will likely be very beneficial to your log likelihood. However, the -2k part of the AIC calculation will barely penalize the model for it. It just seems to me that the AIC doesn't appropriately penalize for extra params.

– Jason
Apr 5 at 13:51

In my comment above, it should be negative likelihood, not raw likelihood.

– Richard Hardy
Apr 5 at 15:22

add a comment |

1 Answer
1

active

oldest

votes

It's a known criticism of AIC.

The BIC scales the penalty of number of model parameters by the root of n. In even larger sample sizes,

$$ textBIC = log(n) k - 2 log mathcalL,$$

though you will still tend to find BIC favors models with more parameters in larger samples. In either case, it's a desirable trait of model selection criteria that tends to select more parameters in larger sample sizes. It all boils down to how many you want to enter into a particular model for a particular sample size. When that's a finite number, there's no reason to use information criteria at all.

Shibata's work on AIC works under the concept of "mean efficiency". That is: ICs work under the condition that you know or assume that the number of variables in an ideal model is infinitely valued, and that in larger samples you will tend to favor models with more variables.

edited Apr 5 at 15:11

answered Apr 5 at 14:46

AdamO

34.5k264142

$begingroup$
You can criticize a hammer if your problem does not look like a nail, but I wonder if there is any ground for criticizing the design of AIC taking into account what it actually aims for. After all, AIC is an efficient model selection criterion, which BIC and other criteria with relatively fast increasing penalties are not. So if your goal is optimal prediction (optimal in terms of maximizing the likelihood of a new observation), AIC will do it for you. If your goal is not prediction, why would you be considering AIC to begin with? Does that make sense?
$endgroup$
– Richard Hardy
Apr 5 at 15:19

$begingroup$
OK, I guess you can justify your criticism of the assumption of infinitely many parameters in the "ideal model", as you mention in your last paragraph. So then the question would be, does my problem look like one where this assumption may hold or not? If so, AIC is fine, if not, go look for another information criterion.
$endgroup$
– Richard Hardy
Apr 5 at 15:26

$begingroup$
@RichardHardy We agree on all points. The revelation that AIC only works in some very contrived situations won't stop people from asking whether it functions well in other situations. The answer, aside from "it wasn't meant to do that" is "it doesn't do that very well". It's a revelation that another inappropriate tool (BIC) "does it a bit better". There are much, much better tools for data reduction if OP wants a "sparse number of predictors in a reasonably large sample", but it wasn't the question that was asked.
$endgroup$
– AdamO
Apr 5 at 15:41

$begingroup$
Good. I would contest, however, your use of "very contrived situations", or even "contrived situations". A large (the largest?) part of real world phenomena are results of infinitely complex data generating processes which require an infinite amount of parameters to be fully charaterized, which is exactly what the premise of AIC is. Hence, as long as the goal is optimal prediction, AIC strikes me as the most reasonable choice, or at least a solid baseline. When prediction is not the goal while, say, finding a sparse number of predictors is, we need other tools.
$endgroup$
– Richard Hardy
Apr 5 at 16:35

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Jason is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401363%2faic-for-increasing-sample-size%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

It's a known criticism of AIC.

The BIC scales the penalty of number of model parameters by the root of n. In even larger sample sizes,

$$ textBIC = log(n) k - 2 log mathcalL,$$

edited Apr 5 at 15:11

answered Apr 5 at 14:46

AdamO

34.5k264142

$begingroup$
You can criticize a hammer if your problem does not look like a nail, but I wonder if there is any ground for criticizing the design of AIC taking into account what it actually aims for. After all, AIC is an efficient model selection criterion, which BIC and other criteria with relatively fast increasing penalties are not. So if your goal is optimal prediction (optimal in terms of maximizing the likelihood of a new observation), AIC will do it for you. If your goal is not prediction, why would you be considering AIC to begin with? Does that make sense?
$endgroup$
– Richard Hardy
Apr 5 at 15:19

$begingroup$
OK, I guess you can justify your criticism of the assumption of infinitely many parameters in the "ideal model", as you mention in your last paragraph. So then the question would be, does my problem look like one where this assumption may hold or not? If so, AIC is fine, if not, go look for another information criterion.
$endgroup$
– Richard Hardy
Apr 5 at 15:26

$begingroup$
@RichardHardy We agree on all points. The revelation that AIC only works in some very contrived situations won't stop people from asking whether it functions well in other situations. The answer, aside from "it wasn't meant to do that" is "it doesn't do that very well". It's a revelation that another inappropriate tool (BIC) "does it a bit better". There are much, much better tools for data reduction if OP wants a "sparse number of predictors in a reasonably large sample", but it wasn't the question that was asked.
$endgroup$
– AdamO
Apr 5 at 15:41

$begingroup$
Good. I would contest, however, your use of "very contrived situations", or even "contrived situations". A large (the largest?) part of real world phenomena are results of infinitely complex data generating processes which require an infinite amount of parameters to be fully charaterized, which is exactly what the premise of AIC is. Hence, as long as the goal is optimal prediction, AIC strikes me as the most reasonable choice, or at least a solid baseline. When prediction is not the goal while, say, finding a sparse number of predictors is, we need other tools.
$endgroup$
– Richard Hardy
Apr 5 at 16:35

add a comment |

It's a known criticism of AIC.

The BIC scales the penalty of number of model parameters by the root of n. In even larger sample sizes,

$$ textBIC = log(n) k - 2 log mathcalL,$$

edited Apr 5 at 15:11

answered Apr 5 at 14:46

AdamO

34.5k264142

$begingroup$
You can criticize a hammer if your problem does not look like a nail, but I wonder if there is any ground for criticizing the design of AIC taking into account what it actually aims for. After all, AIC is an efficient model selection criterion, which BIC and other criteria with relatively fast increasing penalties are not. So if your goal is optimal prediction (optimal in terms of maximizing the likelihood of a new observation), AIC will do it for you. If your goal is not prediction, why would you be considering AIC to begin with? Does that make sense?
$endgroup$
– Richard Hardy
Apr 5 at 15:19

$begingroup$
OK, I guess you can justify your criticism of the assumption of infinitely many parameters in the "ideal model", as you mention in your last paragraph. So then the question would be, does my problem look like one where this assumption may hold or not? If so, AIC is fine, if not, go look for another information criterion.
$endgroup$
– Richard Hardy
Apr 5 at 15:26

$begingroup$
@RichardHardy We agree on all points. The revelation that AIC only works in some very contrived situations won't stop people from asking whether it functions well in other situations. The answer, aside from "it wasn't meant to do that" is "it doesn't do that very well". It's a revelation that another inappropriate tool (BIC) "does it a bit better". There are much, much better tools for data reduction if OP wants a "sparse number of predictors in a reasonably large sample", but it wasn't the question that was asked.
$endgroup$
– AdamO
Apr 5 at 15:41

$begingroup$
Good. I would contest, however, your use of "very contrived situations", or even "contrived situations". A large (the largest?) part of real world phenomena are results of infinitely complex data generating processes which require an infinite amount of parameters to be fully charaterized, which is exactly what the premise of AIC is. Hence, as long as the goal is optimal prediction, AIC strikes me as the most reasonable choice, or at least a solid baseline. When prediction is not the goal while, say, finding a sparse number of predictors is, we need other tools.
$endgroup$
– Richard Hardy
Apr 5 at 16:35

add a comment |

It's a known criticism of AIC.

The BIC scales the penalty of number of model parameters by the root of n. In even larger sample sizes,

$$ textBIC = log(n) k - 2 log mathcalL,$$

edited Apr 5 at 15:11

answered Apr 5 at 14:46

AdamO

34.5k264142

It's a known criticism of AIC.

The BIC scales the penalty of number of model parameters by the root of n. In even larger sample sizes,

$$ textBIC = log(n) k - 2 log mathcalL,$$

edited Apr 5 at 15:11

answered Apr 5 at 14:46

AdamO

34.5k264142

edited Apr 5 at 15:11

answered Apr 5 at 14:46

AdamO

34.5k264142

answered Apr 5 at 14:46

AdamO

34.5k264142

answered Apr 5 at 14:46

AdamO

34.5k264142

$begingroup$
You can criticize a hammer if your problem does not look like a nail, but I wonder if there is any ground for criticizing the design of AIC taking into account what it actually aims for. After all, AIC is an efficient model selection criterion, which BIC and other criteria with relatively fast increasing penalties are not. So if your goal is optimal prediction (optimal in terms of maximizing the likelihood of a new observation), AIC will do it for you. If your goal is not prediction, why would you be considering AIC to begin with? Does that make sense?
$endgroup$
– Richard Hardy
Apr 5 at 15:19

$begingroup$
OK, I guess you can justify your criticism of the assumption of infinitely many parameters in the "ideal model", as you mention in your last paragraph. So then the question would be, does my problem look like one where this assumption may hold or not? If so, AIC is fine, if not, go look for another information criterion.
$endgroup$
– Richard Hardy
Apr 5 at 15:26

$begingroup$
@RichardHardy We agree on all points. The revelation that AIC only works in some very contrived situations won't stop people from asking whether it functions well in other situations. The answer, aside from "it wasn't meant to do that" is "it doesn't do that very well". It's a revelation that another inappropriate tool (BIC) "does it a bit better". There are much, much better tools for data reduction if OP wants a "sparse number of predictors in a reasonably large sample", but it wasn't the question that was asked.
$endgroup$
– AdamO
Apr 5 at 15:41

$begingroup$
Good. I would contest, however, your use of "very contrived situations", or even "contrived situations". A large (the largest?) part of real world phenomena are results of infinitely complex data generating processes which require an infinite amount of parameters to be fully charaterized, which is exactly what the premise of AIC is. Hence, as long as the goal is optimal prediction, AIC strikes me as the most reasonable choice, or at least a solid baseline. When prediction is not the goal while, say, finding a sparse number of predictors is, we need other tools.
$endgroup$
– Richard Hardy
Apr 5 at 16:35

add a comment |

$begingroup$
You can criticize a hammer if your problem does not look like a nail, but I wonder if there is any ground for criticizing the design of AIC taking into account what it actually aims for. After all, AIC is an efficient model selection criterion, which BIC and other criteria with relatively fast increasing penalties are not. So if your goal is optimal prediction (optimal in terms of maximizing the likelihood of a new observation), AIC will do it for you. If your goal is not prediction, why would you be considering AIC to begin with? Does that make sense?
$endgroup$
– Richard Hardy
Apr 5 at 15:19

$begingroup$
OK, I guess you can justify your criticism of the assumption of infinitely many parameters in the "ideal model", as you mention in your last paragraph. So then the question would be, does my problem look like one where this assumption may hold or not? If so, AIC is fine, if not, go look for another information criterion.
$endgroup$
– Richard Hardy
Apr 5 at 15:26

$begingroup$
@RichardHardy We agree on all points. The revelation that AIC only works in some very contrived situations won't stop people from asking whether it functions well in other situations. The answer, aside from "it wasn't meant to do that" is "it doesn't do that very well". It's a revelation that another inappropriate tool (BIC) "does it a bit better". There are much, much better tools for data reduction if OP wants a "sparse number of predictors in a reasonably large sample", but it wasn't the question that was asked.
$endgroup$
– AdamO
Apr 5 at 15:41

$begingroup$
Good. I would contest, however, your use of "very contrived situations", or even "contrived situations". A large (the largest?) part of real world phenomena are results of infinitely complex data generating processes which require an infinite amount of parameters to be fully charaterized, which is exactly what the premise of AIC is. Hence, as long as the goal is optimal prediction, AIC strikes me as the most reasonable choice, or at least a solid baseline. When prediction is not the goal while, say, finding a sparse number of predictors is, we need other tools.
$endgroup$
– Richard Hardy
Apr 5 at 16:35

You can criticize a hammer if your problem does not look like a nail, but I wonder if there is any ground for criticizing the design of AIC taking into account what it actually aims for. After all, AIC is an efficient model selection criterion, which BIC and other criteria with relatively fast increasing penalties are not. So if your goal is optimal prediction (optimal in terms of maximizing the likelihood of a new observation), AIC will do it for you. If your goal is not prediction, why would you be considering AIC to begin with? Does that make sense?

– Richard Hardy
Apr 5 at 15:19

OK, I guess you can justify your criticism of the assumption of infinitely many parameters in the "ideal model", as you mention in your last paragraph. So then the question would be, does my problem look like one where this assumption may hold or not? If so, AIC is fine, if not, go look for another information criterion.

– Richard Hardy
Apr 5 at 15:26

@RichardHardy We agree on all points. The revelation that AIC only works in some very contrived situations won't stop people from asking whether it functions well in other situations. The answer, aside from "it wasn't meant to do that" is "it doesn't do that very well". It's a revelation that another inappropriate tool (BIC) "does it a bit better". There are much, much better tools for data reduction if OP wants a "sparse number of predictors in a reasonably large sample", but it wasn't the question that was asked.

– AdamO
Apr 5 at 15:41

Good. I would contest, however, your use of "very contrived situations", or even "contrived situations". A large (the largest?) part of real world phenomena are results of infinitely complex data generating processes which require an infinite amount of parameters to be fully charaterized, which is exactly what the premise of AIC is. Hence, as long as the goal is optimal prediction, AIC strikes me as the most reasonable choice, or at least a solid baseline. When prediction is not the goal while, say, finding a sparse number of predictors is, we need other tools.

– Richard Hardy
Apr 5 at 16:35

add a comment |

Jason is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Jason is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Otdfbt

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1