How to handle columns with categorical data and many unique values The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election Resultsdecision trees on mix of categorical and real value parametersPandas categorical variables encoding for regression (one-hot encoding vs dummy encoding)Imputation of missing values and dealing with categorical valuesHow to deal with categorical variablesOne hot encoding error “sort.list(y)…”One hot encoding vs Word embeddingHow to implement feature selection for categorical variables (especially with many categories)?ML Models: How to handle categorical feature with over 1000 unique values“Binary Encoding” in “Decision Tree” / “Random Forest” AlgorithmsDealing with multiple distinct-value categorical variables

Multi tool use
Multi tool use

different output for groups and groups USERNAME after adding a username to a group

"is" operation returns false with ndarray.data attribute, even though two array objects have same id

Identify 80s or 90s comics with ripped creatures (not dwarves)

How do I design a circuit to convert a 100 mV and 50 Hz sine wave to a square wave?

Is this wall load bearing? Blueprints and photos attached

What's the point in a preamp?

Did the new image of black hole confirm the general theory of relativity?

For what reasons would an animal species NOT cross a *horizontal* land bridge?

Why don't hard Brexiteers insist on a hard border to prevent illegal immigration after Brexit?

Is 'stolen' appropriate word?

How to determine omitted units in a publication

How do spell lists change if the party levels up without taking a long rest?

60's-70's movie: home appliances revolting against the owners

Did the UK government pay "millions and millions of dollars" to try to snag Julian Assange?

Are there continuous functions who are the same in an interval but differ in at least one other point?

Why are PDP-7-style microprogrammed instructions out of vogue?

What aspect of planet Earth must be changed to prevent the industrial revolution?

Windows 10: How to Lock (not sleep) laptop on lid close?

Match Roman Numerals

Mortgage adviser recommends a longer term than necessary combined with overpayments

Huge performance difference of the command find with and without using %M option to show permissions

Can a flute soloist sit?

What was the last x86 CPU that did not have the x87 floating-point unit built in?

What to do when moving next to a bird sanctuary with a loosely-domesticated cat?



How to handle columns with categorical data and many unique values



The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election Resultsdecision trees on mix of categorical and real value parametersPandas categorical variables encoding for regression (one-hot encoding vs dummy encoding)Imputation of missing values and dealing with categorical valuesHow to deal with categorical variablesOne hot encoding error “sort.list(y)…”One hot encoding vs Word embeddingHow to implement feature selection for categorical variables (especially with many categories)?ML Models: How to handle categorical feature with over 1000 unique values“Binary Encoding” in “Decision Tree” / “Random Forest” AlgorithmsDealing with multiple distinct-value categorical variables










4












$begingroup$


I have a column with categorical data with nunique 3349 values, in a 18000k row dataset, which represent cities of the world.



I also have another column with 145 nunique values that I could also use in my model that represents product category.



Can I use one hot encoding to these columns or there's a problem with that solution?
Like which is the max number of unique values to use one hot encoding so there's not gonna be any problem ?



Can you point me to the right direction if I should use another encoding also?










share|improve this question









$endgroup$
















    4












    $begingroup$


    I have a column with categorical data with nunique 3349 values, in a 18000k row dataset, which represent cities of the world.



    I also have another column with 145 nunique values that I could also use in my model that represents product category.



    Can I use one hot encoding to these columns or there's a problem with that solution?
    Like which is the max number of unique values to use one hot encoding so there's not gonna be any problem ?



    Can you point me to the right direction if I should use another encoding also?










    share|improve this question









    $endgroup$














      4












      4








      4


      1



      $begingroup$


      I have a column with categorical data with nunique 3349 values, in a 18000k row dataset, which represent cities of the world.



      I also have another column with 145 nunique values that I could also use in my model that represents product category.



      Can I use one hot encoding to these columns or there's a problem with that solution?
      Like which is the max number of unique values to use one hot encoding so there's not gonna be any problem ?



      Can you point me to the right direction if I should use another encoding also?










      share|improve this question









      $endgroup$




      I have a column with categorical data with nunique 3349 values, in a 18000k row dataset, which represent cities of the world.



      I also have another column with 145 nunique values that I could also use in my model that represents product category.



      Can I use one hot encoding to these columns or there's a problem with that solution?
      Like which is the max number of unique values to use one hot encoding so there's not gonna be any problem ?



      Can you point me to the right direction if I should use another encoding also?







      machine-learning data categorical-data encoding






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 8 at 11:04









      dungeondungeon

      394




      394




















          1 Answer
          1






          active

          oldest

          votes


















          5












          $begingroup$

          For categorical columns, you have two options :



          1. Entity Embeddings

          2. One Hot Vector

          For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



          Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



          Articles that explain Embeddings :



          • An Overview of Categorical Input Handling for Neural Networks


          • On learning embeddings for categorical data using Keras


          • Google Developers > Machine Learning > Embeddings: Categorical Input Data


          • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner






          share|improve this answer











          $endgroup$













            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48875%2fhow-to-handle-columns-with-categorical-data-and-many-unique-values%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            5












            $begingroup$

            For categorical columns, you have two options :



            1. Entity Embeddings

            2. One Hot Vector

            For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



            Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



            Articles that explain Embeddings :



            • An Overview of Categorical Input Handling for Neural Networks


            • On learning embeddings for categorical data using Keras


            • Google Developers > Machine Learning > Embeddings: Categorical Input Data


            • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner






            share|improve this answer











            $endgroup$

















              5












              $begingroup$

              For categorical columns, you have two options :



              1. Entity Embeddings

              2. One Hot Vector

              For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



              Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



              Articles that explain Embeddings :



              • An Overview of Categorical Input Handling for Neural Networks


              • On learning embeddings for categorical data using Keras


              • Google Developers > Machine Learning > Embeddings: Categorical Input Data


              • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner






              share|improve this answer











              $endgroup$















                5












                5








                5





                $begingroup$

                For categorical columns, you have two options :



                1. Entity Embeddings

                2. One Hot Vector

                For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



                Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



                Articles that explain Embeddings :



                • An Overview of Categorical Input Handling for Neural Networks


                • On learning embeddings for categorical data using Keras


                • Google Developers > Machine Learning > Embeddings: Categorical Input Data


                • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner






                share|improve this answer











                $endgroup$



                For categorical columns, you have two options :



                1. Entity Embeddings

                2. One Hot Vector

                For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



                Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



                Articles that explain Embeddings :



                • An Overview of Categorical Input Handling for Neural Networks


                • On learning embeddings for categorical data using Keras


                • Google Developers > Machine Learning > Embeddings: Categorical Input Data


                • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 8 at 15:10

























                answered Apr 8 at 12:05









                Shamit VermaShamit Verma

                1,5741314




                1,5741314



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48875%2fhow-to-handle-columns-with-categorical-data-and-many-unique-values%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    3,V9 n5lRuDnkkAh4,f,or Zf7b,uzmpi,3Kf2rN5NK,x5UFy xLOGt21qlVEtLpZpF3H4plSR4yM Tgn7M,MnyfnkEns6QSN78
                    SVUkGmnLiBVlLFoKusBQ w,UDi4hxv49vbw1rU5LZ2

                    Popular posts from this blog

                    RemoteApp sporadic failureWindows 2008 RemoteAPP client disconnects within a matter of minutesWhat is the minimum version of RDP supported by Server 2012 RDS?How to configure a Remoteapp server to increase stabilityMicrosoft RemoteApp Active SessionRDWeb TS connection broken for some users post RemoteApp certificate changeRemote Desktop Licensing, RemoteAPPRDS 2012 R2 some users are not able to logon after changed date and time on Connection BrokersWhat happens during Remote Desktop logon, and is there any logging?After installing RDS on WinServer 2016 I still can only connect with two users?RD Connection via RDGW to Session host is not connecting

                    Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070

                    Cegueira Índice Epidemioloxía | Deficiencia visual | Tipos de cegueira | Principais causas de cegueira | Tratamento | Técnicas de adaptación e axudas | Vida dos cegos | Primeiros auxilios | Crenzas respecto das persoas cegas | Crenzas das persoas cegas | O neno deficiente visual | Aspectos psicolóxicos da cegueira | Notas | Véxase tamén | Menú de navegación54.054.154.436928256blindnessDicionario da Real Academia GalegaPortal das Palabras"International Standards: Visual Standards — Aspects and Ranges of Vision Loss with Emphasis on Population Surveys.""Visual impairment and blindness""Presentan un plan para previr a cegueira"o orixinalACCDV Associació Catalana de Cecs i Disminuïts Visuals - PMFTrachoma"Effect of gene therapy on visual function in Leber's congenital amaurosis"1844137110.1056/NEJMoa0802268Cans guía - os mellores amigos dos cegosArquivadoEscola de cans guía para cegos en Mortágua, PortugalArquivado"Tecnología para ciegos y deficientes visuales. Recopilación de recursos gratuitos en la Red""Colorino""‘COL.diesis’, escuchar los sonidos del color""COL.diesis: Transforming Colour into Melody and Implementing the Result in a Colour Sensor Device"o orixinal"Sistema de desarrollo de sinestesia color-sonido para invidentes utilizando un protocolo de audio""Enseñanza táctil - geometría y color. Juegos didácticos para niños ciegos y videntes""Sistema Constanz"L'ocupació laboral dels cecs a l'Estat espanyol està pràcticament equiparada a la de les persones amb visió, entrevista amb Pedro ZuritaONCE (Organización Nacional de Cegos de España)Prevención da cegueiraDescrición de deficiencias visuais (Disc@pnet)Braillín, un boneco atractivo para calquera neno, con ou sen discapacidade, que permite familiarizarse co sistema de escritura e lectura brailleAxudas Técnicas36838ID00897494007150-90057129528256DOID:1432HP:0000618D001766C10.597.751.941.162C97109C0155020