How to give a higher importance to certain features in a (k-means) clustering model? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsK-Means clustering for mixed numeric and categorical dataHow to visualize (make plot) of regression output against categorical input variable?Calculate feature weight vector for one-hot-encoded data frame in RModel-agnostic variable importance metricData scaling before PCA: how to deal with categorical values?Extracting useful features for k-means clusteringHow to deal with Nominal categorical with label encoding?How to deal with a potencially multiple categorical variablePython - Create many dummy variables from one text variable?How to validate clusters after calculating Gower distances and Ward's clustering in RPerform k-means clustering over multiple columns

2 sample t test for sample sizes - 30,000 and 150,000

Suing a Police Officer Instead of the Police Department

Has a Nobel Peace laureate ever been accused of war crimes?

Should man-made satellites feature an intelligent inverted "cow catcher"?

How to break 信じようとしていただけかも知れない into separate parts?

Protagonist's race is hidden - should I reveal it?

Why do people think Winterfell crypts is the safest place for women, children & old people?

Is Bran literally the world's memory?

false 'Security alert' from Google - every login generates mails from 'no-reply@accounts.google.com'

Why these surprising proportionalities of integrals involving odd zeta values?

/bin/ls sorts differently than just ls

Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?

Does the Pact of the Blade warlock feature allow me to customize the properties of the pact weapon I create?

What could prevent concentrated local exploration?

How to leave only the following strings?

If gravity precedes the formation of a solar system, where did the mass come from that caused the gravity?

Married in secret, can marital status in passport be changed at a later date?

Why does my GNOME settings mention "Moto C Plus"?

Does traveling In The United States require a passport or can I use my green card if not a US citizen?

Is my guitar’s action too high?

Compiling and throwing simple dynamic exceptions at runtime for JVM

When does Bran Stark remember Jamie pushing him?

What is the evidence that custom checks in Northern Ireland are going to result in violence?

Does using the Inspiration rules for character defects encourage My Guy Syndrome?



How to give a higher importance to certain features in a (k-means) clustering model?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsK-Means clustering for mixed numeric and categorical dataHow to visualize (make plot) of regression output against categorical input variable?Calculate feature weight vector for one-hot-encoded data frame in RModel-agnostic variable importance metricData scaling before PCA: how to deal with categorical values?Extracting useful features for k-means clusteringHow to deal with Nominal categorical with label encoding?How to deal with a potencially multiple categorical variablePython - Create many dummy variables from one text variable?How to validate clusters after calculating Gower distances and Ward's clustering in RPerform k-means clustering over multiple columns










6












$begingroup$


I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.



For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.



My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.










share|improve this question







New contributor




Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$
















    6












    $begingroup$


    I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.



    For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.



    My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.










    share|improve this question







    New contributor




    Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$














      6












      6








      6


      1



      $begingroup$


      I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.



      For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.



      My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.










      share|improve this question







      New contributor




      Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I am clustering data with numeric and categorical variables. To process the categorical variables for the cluster model, I create dummy variables. However, I feel like this results in a higher importance for these dummy variables because multiple dummy variables represent one categorical variable.



      For example, I have a categorical variable Airport that will result in multiple dummy variables: LAX, JFK, MIA and BOS. Now suppose I also have a numeric Temperature variable. I also scale all variables to be between 0 and 1. Now my Airport variable seems to be 4 times more important than the Temperature variable, and the clusters will be mostly based on the Airport variable.



      My problem is that I want all variables to have the same importance. Is there a way to do this? I was thinking of scaling the variables in a different way but I don't know how to scale them in order to give them the same importance.







      machine-learning clustering feature-scaling dummy-variables






      share|improve this question







      New contributor




      Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Apr 16 at 8:33









      EvaEva

      363




      363




      New contributor




      Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Eva is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          3 Answers
          3






          active

          oldest

          votes


















          6












          $begingroup$

          You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.



          I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.






          share|improve this answer











          $endgroup$




















            3












            $begingroup$

            Clearly the objective function uses a sum over the features.



            So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



            However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.






            share|improve this answer









            $endgroup$




















              2












              $begingroup$

              You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
              Please check the following paper:



              "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.






              share|improve this answer









              $endgroup$













                Your Answer








                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "557"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );






                Eva is a new contributor. Be nice, and check out our Code of Conduct.









                draft saved

                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49381%2fhow-to-give-a-higher-importance-to-certain-features-in-a-k-means-clustering-mo%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                6












                $begingroup$

                You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.



                I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.






                share|improve this answer











                $endgroup$

















                  6












                  $begingroup$

                  You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.



                  I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.






                  share|improve this answer











                  $endgroup$















                    6












                    6








                    6





                    $begingroup$

                    You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.



                    I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.






                    share|improve this answer











                    $endgroup$



                    You cannot really use k-means clustering if your data contains categorical variables since k-means uses Euclidian distance which will not make a lot of sense with dummy-variables. Check out the answers to this similar question.



                    I would suggest, you switch to k-modes for your clustering algorithm. You will find good implementations both for Python and R.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Apr 16 at 14:53

























                    answered Apr 16 at 9:15









                    georg_ungeorg_un

                    318111




                    318111





















                        3












                        $begingroup$

                        Clearly the objective function uses a sum over the features.



                        So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



                        However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.






                        share|improve this answer









                        $endgroup$

















                          3












                          $begingroup$

                          Clearly the objective function uses a sum over the features.



                          So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



                          However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.






                          share|improve this answer









                          $endgroup$















                            3












                            3








                            3





                            $begingroup$

                            Clearly the objective function uses a sum over the features.



                            So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



                            However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.






                            share|improve this answer









                            $endgroup$



                            Clearly the objective function uses a sum over the features.



                            So if you want to increase the importance of a feature, scale it accordingly. If you scale it by 2, the squares grow by 4. So you have increased the weight.



                            However, I would just not use k-means for one-hot variables. The mean is for continuous variables, minimizing the sum of squares on a one-hot variable has weird semantics.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Apr 16 at 13:34









                            Anony-MousseAnony-Mousse

                            5,315625




                            5,315625





















                                2












                                $begingroup$

                                You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
                                Please check the following paper:



                                "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.






                                share|improve this answer









                                $endgroup$

















                                  2












                                  $begingroup$

                                  You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
                                  Please check the following paper:



                                  "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.






                                  share|improve this answer









                                  $endgroup$















                                    2












                                    2








                                    2





                                    $begingroup$

                                    You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
                                    Please check the following paper:



                                    "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.






                                    share|improve this answer









                                    $endgroup$



                                    You cannot use k-means clustering algorithm, if your data contains categorical variables and k-modes is suitable for clustering categorigal data. However, there are several algorithms for clustering mixed data, which actually are variationsmodifications of the basic ones.
                                    Please check the following paper:



                                    "Survey of State-of-the-Art Mixed Data Clustering Algorithms", Amir Ahmad and Sheorz Khan, 2019.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Apr 16 at 22:18









                                    Christos KaratsalosChristos Karatsalos

                                    54719




                                    54719




















                                        Eva is a new contributor. Be nice, and check out our Code of Conduct.









                                        draft saved

                                        draft discarded


















                                        Eva is a new contributor. Be nice, and check out our Code of Conduct.












                                        Eva is a new contributor. Be nice, and check out our Code of Conduct.











                                        Eva is a new contributor. Be nice, and check out our Code of Conduct.














                                        Thanks for contributing an answer to Data Science Stack Exchange!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid


                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.

                                        Use MathJax to format equations. MathJax reference.


                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f49381%2fhow-to-give-a-higher-importance-to-certain-features-in-a-k-means-clustering-mo%23new-answer', 'question_page');

                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        Popular posts from this blog

                                        Wikipedia:Vital articles Мазмуну Biography - Өмүр баян Philosophy and psychology - Философия жана психология Religion - Дин Social sciences - Коомдук илимдер Language and literature - Тил жана адабият Science - Илим Technology - Технология Arts and recreation - Искусство жана эс алуу History and geography - Тарых жана география Навигация менюсу

                                        Club Baloncesto Breogán Índice Historia | Pavillón | Nome | O Breogán na cultura popular | Xogadores | Adestradores | Presidentes | Palmarés | Historial | Líderes | Notas | Véxase tamén | Menú de navegacióncbbreogan.galCadroGuía oficial da ACB 2009-10, páxina 201Guía oficial ACB 1992, páxina 183. Editorial DB.É de 6.500 espectadores sentados axeitándose á última normativa"Estudiantes Junior, entre as mellores canteiras"o orixinalHemeroteca El Mundo Deportivo, 16 setembro de 1970, páxina 12Historia do BreogánAlfredo Pérez, o último canoneiroHistoria C.B. BreogánHemeroteca de El Mundo DeportivoJimmy Wright, norteamericano do Breogán deixará Lugo por ameazas de morteResultados de Breogán en 1986-87Resultados de Breogán en 1990-91Ficha de Velimir Perasović en acb.comResultados de Breogán en 1994-95Breogán arrasa al Barça. "El Mundo Deportivo", 27 de setembro de 1999, páxina 58CB Breogán - FC BarcelonaA FEB invita a participar nunha nova Liga EuropeaCharlie Bell na prensa estatalMáximos anotadores 2005Tempada 2005-06 : Tódolos Xogadores da Xornada""Non quero pensar nunha man negra, mais pregúntome que está a pasar""o orixinalRaúl López, orgulloso dos xogadores, presume da boa saúde económica do BreogánJulio González confirma que cesa como presidente del BreogánHomenaxe a Lisardo GómezA tempada do rexurdimento celesteEntrevista a Lisardo GómezEl COB dinamita el Pazo para forzar el quinto (69-73)Cafés Candelas, patrocinador del CB Breogán"Suso Lázare, novo presidente do Breogán"o orixinalCafés Candelas Breogán firma el mayor triunfo de la historiaEl Breogán realizará 17 homenajes por su cincuenta aniversario"O Breogán honra ao seu fundador e primeiro presidente"o orixinalMiguel Giao recibiu a homenaxe do PazoHomenaxe aos primeiros gladiadores celestesO home que nos amosa como ver o Breo co corazónTita Franco será homenaxeada polos #50anosdeBreoJulio Vila recibirá unha homenaxe in memoriam polos #50anosdeBreo"O Breogán homenaxeará aos seus aboados máis veteráns"Pechada ovación a «Capi» Sanmartín e Ricardo «Corazón de González»Homenaxe por décadas de informaciónPaco García volve ao Pazo con motivo do 50 aniversario"Resultados y clasificaciones""O Cafés Candelas Breogán, campión da Copa Princesa""O Cafés Candelas Breogán, equipo ACB"C.B. Breogán"Proxecto social"o orixinal"Centros asociados"o orixinalFicha en imdb.comMario Camus trata la recuperación del amor en 'La vieja música', su última película"Páxina web oficial""Club Baloncesto Breogán""C. B. Breogán S.A.D."eehttp://www.fegaba.com

                                        Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070