LightGBM results differently depending on the order of the dataTune hyperparameters for cost-sensitive classificationR - Error in KNN - Test and training differLearning Algorithm that decide which model gives better results for each testing instanceSignificant overfitting with CVwhy is mse training drastically different from the begining of each training with Encoder-DecoderClassification using Orange 3What to report in the build model, asses model and evaluate results steps of CRISP-DM?Using GridSearchCV and a Random Forest Regressor with the same parameters gives different resultsHyper-parameter tuning when you don't have an access to the test dataUsing the validation data
How to handle DM constantly stealing everything from sleeping characters?
How to get my matrix to fit on the page
Why was the ancient one so hesitant to teach Dr Strange the art of sorcery
Intersecting with the x-axis / intersecting the x-axis
Does the 500 feet falling cap apply per fall, or per turn?
Why does it take longer to fly from London to Xi'an than to Beijing
Passport stamps art, can it be done?
Why do protein solutions have to be alkalised in biuret test?
Succinct and gender-neutral Russian word for "writer"
Why does the Earth follow an elliptical trajectory rather than a parabolic one?
Should I pay on student loans in deferment or continue to snowball other debts?
Company stopped paying my salary. What are my options?
Early arrival in Australia, early check in not available
Is every story set in the future "science fiction"?
Best species to breed to intelligence
Why are parallelograms defined as quadrilaterals? What term would encompass polygons with greater than two parallel pairs?
What does formal training in a field mean?
Why is it wrong to *implement* myself a known, published, widely believed to be secure crypto algorithm?
Noob at soldering, can anyone explain why my circuit wont work?
How to make a language evolve quickly?
Can more than one creature benefit from multiple Hunter's Mark spells cast on the same target?
Series that evaluates to different values upon changing order of summation
Removing all characters except digits from clipboard
Remove color cast in darktable?
LightGBM results differently depending on the order of the data
Tune hyperparameters for cost-sensitive classificationR - Error in KNN - Test and training differLearning Algorithm that decide which model gives better results for each testing instanceSignificant overfitting with CVwhy is mse training drastically different from the begining of each training with Encoder-DecoderClassification using Orange 3What to report in the build model, asses model and evaluate results steps of CRISP-DM?Using GridSearchCV and a Random Forest Regressor with the same parameters gives different resultsHyper-parameter tuning when you don't have an access to the test dataUsing the validation data
$begingroup$
I have two datasets A and B which are exactly the same in terms of the number of columns, name of columns and the values. The only difference is the order of those columns. I then train the LightGBM model on each of the two datasets with the following steps
- Divide each dataset into training and testing (use the same random seed and ratio for both A and B)
- Leave the hyperparameters as pretty much default
- Set a random state as a fixed number (for reproduction)
- Tune the learning_rate using a Grid Search
- Train a LightGBM model on the training set and test it on the
testing set - Learning rate with the best performance on the testing set will be
chosen
The output models on the two datasets are very different, which makes me thinks that the order of columns does affect the performance of the model training using LightGBM.
Do you know why this is the case?
machine-learning classification
$endgroup$
add a comment |
$begingroup$
I have two datasets A and B which are exactly the same in terms of the number of columns, name of columns and the values. The only difference is the order of those columns. I then train the LightGBM model on each of the two datasets with the following steps
- Divide each dataset into training and testing (use the same random seed and ratio for both A and B)
- Leave the hyperparameters as pretty much default
- Set a random state as a fixed number (for reproduction)
- Tune the learning_rate using a Grid Search
- Train a LightGBM model on the training set and test it on the
testing set - Learning rate with the best performance on the testing set will be
chosen
The output models on the two datasets are very different, which makes me thinks that the order of columns does affect the performance of the model training using LightGBM.
Do you know why this is the case?
machine-learning classification
$endgroup$
add a comment |
$begingroup$
I have two datasets A and B which are exactly the same in terms of the number of columns, name of columns and the values. The only difference is the order of those columns. I then train the LightGBM model on each of the two datasets with the following steps
- Divide each dataset into training and testing (use the same random seed and ratio for both A and B)
- Leave the hyperparameters as pretty much default
- Set a random state as a fixed number (for reproduction)
- Tune the learning_rate using a Grid Search
- Train a LightGBM model on the training set and test it on the
testing set - Learning rate with the best performance on the testing set will be
chosen
The output models on the two datasets are very different, which makes me thinks that the order of columns does affect the performance of the model training using LightGBM.
Do you know why this is the case?
machine-learning classification
$endgroup$
I have two datasets A and B which are exactly the same in terms of the number of columns, name of columns and the values. The only difference is the order of those columns. I then train the LightGBM model on each of the two datasets with the following steps
- Divide each dataset into training and testing (use the same random seed and ratio for both A and B)
- Leave the hyperparameters as pretty much default
- Set a random state as a fixed number (for reproduction)
- Tune the learning_rate using a Grid Search
- Train a LightGBM model on the training set and test it on the
testing set - Learning rate with the best performance on the testing set will be
chosen
The output models on the two datasets are very different, which makes me thinks that the order of columns does affect the performance of the model training using LightGBM.
Do you know why this is the case?
machine-learning classification
machine-learning classification
edited May 1 at 10:19
Duy Bui
asked Apr 30 at 17:09
Duy BuiDuy Bui
1363
1363
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
A possible explanation is this:
When the order of the columns differ, there is a little difference in the procedure.
What LightGBM, XGBoost, CatBoost, amongst other do is to select different columns from the features in your dataset in every step in the training.
The selections of these columns is done randomly: Let's say your dataset has 20 columns. The root node selects the features 1st, 3rd and 18th, on both datasets the 1st, 3rd and 18th features are different in both possible datasets. This is repeatedly done and in every step there is randomness affecting your ultimate result.
$endgroup$
$begingroup$
How can we control that randomness when the algorithm selects a subset of features to build a decision tree? That was also my only thought to answer this situation. Moreover, I guess if we always select all features per tree, the algorithm will use Gini (or something similar) to calculate the feature importance at each step, which won't create an randomness.
$endgroup$
– Duy Bui
May 1 at 10:22
$begingroup$
lightgbm
allows the user to set the random seeds used for row and column sampling.
$endgroup$
– bradS
May 1 at 10:47
1
$begingroup$
@bradS: I didn't set the seed as a hyperparameter in the LightGBM but I checked again and seeds should be set as a fixed number by default. That means it should have the same result, which is not the case here. lightgbm.readthedocs.io/en/latest/Parameters.html
$endgroup$
– Duy Bui
May 1 at 12:40
add a comment |
$begingroup$
While the ordering of data is inconsequential in theory, it is important in practice. Considering you took steps to ensure reproducibility, Different ordering of data will alter your train-test split logic(unless you know for certain that the train sets and test sets in both cases are exactly the same). Though you don’t specify how you split the data it is highly possible that a certain assortment of data points makes the machine more robust to outliers and therefore offering better model performance.
In the case that the train and test data is the same in both cases, you’d likely have to see if there is a seed/reproducibility measure (in any part of your code) that you have not taken.
$endgroup$
$begingroup$
Sorry, I forgot to mention that. Will update my query. Train and test are exactly the same because I split them using the same random seed.
$endgroup$
– Duy Bui
May 1 at 10:18
$begingroup$
@DuyBui a few suggestions to try: 1) if you are using Gpu set gpu_use_dp to true From: github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654 2) set num_threads to a fixed number From: github.com/Microsoft/LightGBM/issues/632;
$endgroup$
– gbdata
May 2 at 5:27
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f51188%2flightgbm-results-differently-depending-on-the-order-of-the-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
A possible explanation is this:
When the order of the columns differ, there is a little difference in the procedure.
What LightGBM, XGBoost, CatBoost, amongst other do is to select different columns from the features in your dataset in every step in the training.
The selections of these columns is done randomly: Let's say your dataset has 20 columns. The root node selects the features 1st, 3rd and 18th, on both datasets the 1st, 3rd and 18th features are different in both possible datasets. This is repeatedly done and in every step there is randomness affecting your ultimate result.
$endgroup$
$begingroup$
How can we control that randomness when the algorithm selects a subset of features to build a decision tree? That was also my only thought to answer this situation. Moreover, I guess if we always select all features per tree, the algorithm will use Gini (or something similar) to calculate the feature importance at each step, which won't create an randomness.
$endgroup$
– Duy Bui
May 1 at 10:22
$begingroup$
lightgbm
allows the user to set the random seeds used for row and column sampling.
$endgroup$
– bradS
May 1 at 10:47
1
$begingroup$
@bradS: I didn't set the seed as a hyperparameter in the LightGBM but I checked again and seeds should be set as a fixed number by default. That means it should have the same result, which is not the case here. lightgbm.readthedocs.io/en/latest/Parameters.html
$endgroup$
– Duy Bui
May 1 at 12:40
add a comment |
$begingroup$
A possible explanation is this:
When the order of the columns differ, there is a little difference in the procedure.
What LightGBM, XGBoost, CatBoost, amongst other do is to select different columns from the features in your dataset in every step in the training.
The selections of these columns is done randomly: Let's say your dataset has 20 columns. The root node selects the features 1st, 3rd and 18th, on both datasets the 1st, 3rd and 18th features are different in both possible datasets. This is repeatedly done and in every step there is randomness affecting your ultimate result.
$endgroup$
$begingroup$
How can we control that randomness when the algorithm selects a subset of features to build a decision tree? That was also my only thought to answer this situation. Moreover, I guess if we always select all features per tree, the algorithm will use Gini (or something similar) to calculate the feature importance at each step, which won't create an randomness.
$endgroup$
– Duy Bui
May 1 at 10:22
$begingroup$
lightgbm
allows the user to set the random seeds used for row and column sampling.
$endgroup$
– bradS
May 1 at 10:47
1
$begingroup$
@bradS: I didn't set the seed as a hyperparameter in the LightGBM but I checked again and seeds should be set as a fixed number by default. That means it should have the same result, which is not the case here. lightgbm.readthedocs.io/en/latest/Parameters.html
$endgroup$
– Duy Bui
May 1 at 12:40
add a comment |
$begingroup$
A possible explanation is this:
When the order of the columns differ, there is a little difference in the procedure.
What LightGBM, XGBoost, CatBoost, amongst other do is to select different columns from the features in your dataset in every step in the training.
The selections of these columns is done randomly: Let's say your dataset has 20 columns. The root node selects the features 1st, 3rd and 18th, on both datasets the 1st, 3rd and 18th features are different in both possible datasets. This is repeatedly done and in every step there is randomness affecting your ultimate result.
$endgroup$
A possible explanation is this:
When the order of the columns differ, there is a little difference in the procedure.
What LightGBM, XGBoost, CatBoost, amongst other do is to select different columns from the features in your dataset in every step in the training.
The selections of these columns is done randomly: Let's say your dataset has 20 columns. The root node selects the features 1st, 3rd and 18th, on both datasets the 1st, 3rd and 18th features are different in both possible datasets. This is repeatedly done and in every step there is randomness affecting your ultimate result.
answered Apr 30 at 19:30
Juan Esteban de la CalleJuan Esteban de la Calle
1,131124
1,131124
$begingroup$
How can we control that randomness when the algorithm selects a subset of features to build a decision tree? That was also my only thought to answer this situation. Moreover, I guess if we always select all features per tree, the algorithm will use Gini (or something similar) to calculate the feature importance at each step, which won't create an randomness.
$endgroup$
– Duy Bui
May 1 at 10:22
$begingroup$
lightgbm
allows the user to set the random seeds used for row and column sampling.
$endgroup$
– bradS
May 1 at 10:47
1
$begingroup$
@bradS: I didn't set the seed as a hyperparameter in the LightGBM but I checked again and seeds should be set as a fixed number by default. That means it should have the same result, which is not the case here. lightgbm.readthedocs.io/en/latest/Parameters.html
$endgroup$
– Duy Bui
May 1 at 12:40
add a comment |
$begingroup$
How can we control that randomness when the algorithm selects a subset of features to build a decision tree? That was also my only thought to answer this situation. Moreover, I guess if we always select all features per tree, the algorithm will use Gini (or something similar) to calculate the feature importance at each step, which won't create an randomness.
$endgroup$
– Duy Bui
May 1 at 10:22
$begingroup$
lightgbm
allows the user to set the random seeds used for row and column sampling.
$endgroup$
– bradS
May 1 at 10:47
1
$begingroup$
@bradS: I didn't set the seed as a hyperparameter in the LightGBM but I checked again and seeds should be set as a fixed number by default. That means it should have the same result, which is not the case here. lightgbm.readthedocs.io/en/latest/Parameters.html
$endgroup$
– Duy Bui
May 1 at 12:40
$begingroup$
How can we control that randomness when the algorithm selects a subset of features to build a decision tree? That was also my only thought to answer this situation. Moreover, I guess if we always select all features per tree, the algorithm will use Gini (or something similar) to calculate the feature importance at each step, which won't create an randomness.
$endgroup$
– Duy Bui
May 1 at 10:22
$begingroup$
How can we control that randomness when the algorithm selects a subset of features to build a decision tree? That was also my only thought to answer this situation. Moreover, I guess if we always select all features per tree, the algorithm will use Gini (or something similar) to calculate the feature importance at each step, which won't create an randomness.
$endgroup$
– Duy Bui
May 1 at 10:22
$begingroup$
lightgbm
allows the user to set the random seeds used for row and column sampling.$endgroup$
– bradS
May 1 at 10:47
$begingroup$
lightgbm
allows the user to set the random seeds used for row and column sampling.$endgroup$
– bradS
May 1 at 10:47
1
1
$begingroup$
@bradS: I didn't set the seed as a hyperparameter in the LightGBM but I checked again and seeds should be set as a fixed number by default. That means it should have the same result, which is not the case here. lightgbm.readthedocs.io/en/latest/Parameters.html
$endgroup$
– Duy Bui
May 1 at 12:40
$begingroup$
@bradS: I didn't set the seed as a hyperparameter in the LightGBM but I checked again and seeds should be set as a fixed number by default. That means it should have the same result, which is not the case here. lightgbm.readthedocs.io/en/latest/Parameters.html
$endgroup$
– Duy Bui
May 1 at 12:40
add a comment |
$begingroup$
While the ordering of data is inconsequential in theory, it is important in practice. Considering you took steps to ensure reproducibility, Different ordering of data will alter your train-test split logic(unless you know for certain that the train sets and test sets in both cases are exactly the same). Though you don’t specify how you split the data it is highly possible that a certain assortment of data points makes the machine more robust to outliers and therefore offering better model performance.
In the case that the train and test data is the same in both cases, you’d likely have to see if there is a seed/reproducibility measure (in any part of your code) that you have not taken.
$endgroup$
$begingroup$
Sorry, I forgot to mention that. Will update my query. Train and test are exactly the same because I split them using the same random seed.
$endgroup$
– Duy Bui
May 1 at 10:18
$begingroup$
@DuyBui a few suggestions to try: 1) if you are using Gpu set gpu_use_dp to true From: github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654 2) set num_threads to a fixed number From: github.com/Microsoft/LightGBM/issues/632;
$endgroup$
– gbdata
May 2 at 5:27
add a comment |
$begingroup$
While the ordering of data is inconsequential in theory, it is important in practice. Considering you took steps to ensure reproducibility, Different ordering of data will alter your train-test split logic(unless you know for certain that the train sets and test sets in both cases are exactly the same). Though you don’t specify how you split the data it is highly possible that a certain assortment of data points makes the machine more robust to outliers and therefore offering better model performance.
In the case that the train and test data is the same in both cases, you’d likely have to see if there is a seed/reproducibility measure (in any part of your code) that you have not taken.
$endgroup$
$begingroup$
Sorry, I forgot to mention that. Will update my query. Train and test are exactly the same because I split them using the same random seed.
$endgroup$
– Duy Bui
May 1 at 10:18
$begingroup$
@DuyBui a few suggestions to try: 1) if you are using Gpu set gpu_use_dp to true From: github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654 2) set num_threads to a fixed number From: github.com/Microsoft/LightGBM/issues/632;
$endgroup$
– gbdata
May 2 at 5:27
add a comment |
$begingroup$
While the ordering of data is inconsequential in theory, it is important in practice. Considering you took steps to ensure reproducibility, Different ordering of data will alter your train-test split logic(unless you know for certain that the train sets and test sets in both cases are exactly the same). Though you don’t specify how you split the data it is highly possible that a certain assortment of data points makes the machine more robust to outliers and therefore offering better model performance.
In the case that the train and test data is the same in both cases, you’d likely have to see if there is a seed/reproducibility measure (in any part of your code) that you have not taken.
$endgroup$
While the ordering of data is inconsequential in theory, it is important in practice. Considering you took steps to ensure reproducibility, Different ordering of data will alter your train-test split logic(unless you know for certain that the train sets and test sets in both cases are exactly the same). Though you don’t specify how you split the data it is highly possible that a certain assortment of data points makes the machine more robust to outliers and therefore offering better model performance.
In the case that the train and test data is the same in both cases, you’d likely have to see if there is a seed/reproducibility measure (in any part of your code) that you have not taken.
answered Apr 30 at 17:47
gbdatagbdata
435
435
$begingroup$
Sorry, I forgot to mention that. Will update my query. Train and test are exactly the same because I split them using the same random seed.
$endgroup$
– Duy Bui
May 1 at 10:18
$begingroup$
@DuyBui a few suggestions to try: 1) if you are using Gpu set gpu_use_dp to true From: github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654 2) set num_threads to a fixed number From: github.com/Microsoft/LightGBM/issues/632;
$endgroup$
– gbdata
May 2 at 5:27
add a comment |
$begingroup$
Sorry, I forgot to mention that. Will update my query. Train and test are exactly the same because I split them using the same random seed.
$endgroup$
– Duy Bui
May 1 at 10:18
$begingroup$
@DuyBui a few suggestions to try: 1) if you are using Gpu set gpu_use_dp to true From: github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654 2) set num_threads to a fixed number From: github.com/Microsoft/LightGBM/issues/632;
$endgroup$
– gbdata
May 2 at 5:27
$begingroup$
Sorry, I forgot to mention that. Will update my query. Train and test are exactly the same because I split them using the same random seed.
$endgroup$
– Duy Bui
May 1 at 10:18
$begingroup$
Sorry, I forgot to mention that. Will update my query. Train and test are exactly the same because I split them using the same random seed.
$endgroup$
– Duy Bui
May 1 at 10:18
$begingroup$
@DuyBui a few suggestions to try: 1) if you are using Gpu set gpu_use_dp to true From: github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654 2) set num_threads to a fixed number From: github.com/Microsoft/LightGBM/issues/632;
$endgroup$
– gbdata
May 2 at 5:27
$begingroup$
@DuyBui a few suggestions to try: 1) if you are using Gpu set gpu_use_dp to true From: github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654 2) set num_threads to a fixed number From: github.com/Microsoft/LightGBM/issues/632;
$endgroup$
– gbdata
May 2 at 5:27
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f51188%2flightgbm-results-differently-depending-on-the-order-of-the-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown