How to handle columns with categorical data and many unique values The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election Resultsdecision trees on mix of categorical and real value parametersPandas categorical variables encoding for regression (one-hot encoding vs dummy encoding)Imputation of missing values and dealing with categorical valuesHow to deal with categorical variablesOne hot encoding error “sort.list(y)…”One hot encoding vs Word embeddingHow to implement feature selection for categorical variables (especially with many categories)?ML Models: How to handle categorical feature with over 1000 unique values“Binary Encoding” in “Decision Tree” / “Random Forest” AlgorithmsDealing with multiple distinct-value categorical variables

different output for groups and groups USERNAME after adding a username to a group

"is" operation returns false with ndarray.data attribute, even though two array objects have same id

Identify 80s or 90s comics with ripped creatures (not dwarves)

How do I design a circuit to convert a 100 mV and 50 Hz sine wave to a square wave?

Is this wall load bearing? Blueprints and photos attached

What's the point in a preamp?

Did the new image of black hole confirm the general theory of relativity?

For what reasons would an animal species NOT cross a *horizontal* land bridge?

Why don't hard Brexiteers insist on a hard border to prevent illegal immigration after Brexit?

Is 'stolen' appropriate word?

How to determine omitted units in a publication

How do spell lists change if the party levels up without taking a long rest?

60's-70's movie: home appliances revolting against the owners

Did the UK government pay "millions and millions of dollars" to try to snag Julian Assange?

Are there continuous functions who are the same in an interval but differ in at least one other point?

Why are PDP-7-style microprogrammed instructions out of vogue?

What aspect of planet Earth must be changed to prevent the industrial revolution?

Windows 10: How to Lock (not sleep) laptop on lid close?

Match Roman Numerals

Mortgage adviser recommends a longer term than necessary combined with overpayments

Huge performance difference of the command find with and without using %M option to show permissions

Can a flute soloist sit?

What was the last x86 CPU that did not have the x87 floating-point unit built in?

What to do when moving next to a bird sanctuary with a loosely-domesticated cat?



How to handle columns with categorical data and many unique values



The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election Resultsdecision trees on mix of categorical and real value parametersPandas categorical variables encoding for regression (one-hot encoding vs dummy encoding)Imputation of missing values and dealing with categorical valuesHow to deal with categorical variablesOne hot encoding error “sort.list(y)…”One hot encoding vs Word embeddingHow to implement feature selection for categorical variables (especially with many categories)?ML Models: How to handle categorical feature with over 1000 unique values“Binary Encoding” in “Decision Tree” / “Random Forest” AlgorithmsDealing with multiple distinct-value categorical variables










4












$begingroup$


I have a column with categorical data with nunique 3349 values, in a 18000k row dataset, which represent cities of the world.



I also have another column with 145 nunique values that I could also use in my model that represents product category.



Can I use one hot encoding to these columns or there's a problem with that solution?
Like which is the max number of unique values to use one hot encoding so there's not gonna be any problem ?



Can you point me to the right direction if I should use another encoding also?










share|improve this question









$endgroup$
















    4












    $begingroup$


    I have a column with categorical data with nunique 3349 values, in a 18000k row dataset, which represent cities of the world.



    I also have another column with 145 nunique values that I could also use in my model that represents product category.



    Can I use one hot encoding to these columns or there's a problem with that solution?
    Like which is the max number of unique values to use one hot encoding so there's not gonna be any problem ?



    Can you point me to the right direction if I should use another encoding also?










    share|improve this question









    $endgroup$














      4












      4








      4


      1



      $begingroup$


      I have a column with categorical data with nunique 3349 values, in a 18000k row dataset, which represent cities of the world.



      I also have another column with 145 nunique values that I could also use in my model that represents product category.



      Can I use one hot encoding to these columns or there's a problem with that solution?
      Like which is the max number of unique values to use one hot encoding so there's not gonna be any problem ?



      Can you point me to the right direction if I should use another encoding also?










      share|improve this question









      $endgroup$




      I have a column with categorical data with nunique 3349 values, in a 18000k row dataset, which represent cities of the world.



      I also have another column with 145 nunique values that I could also use in my model that represents product category.



      Can I use one hot encoding to these columns or there's a problem with that solution?
      Like which is the max number of unique values to use one hot encoding so there's not gonna be any problem ?



      Can you point me to the right direction if I should use another encoding also?







      machine-learning data categorical-data encoding






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 8 at 11:04









      dungeondungeon

      394




      394




















          1 Answer
          1






          active

          oldest

          votes


















          5












          $begingroup$

          For categorical columns, you have two options :



          1. Entity Embeddings

          2. One Hot Vector

          For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



          Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



          Articles that explain Embeddings :



          • An Overview of Categorical Input Handling for Neural Networks


          • On learning embeddings for categorical data using Keras


          • Google Developers > Machine Learning > Embeddings: Categorical Input Data


          • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner






          share|improve this answer











          $endgroup$













            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48875%2fhow-to-handle-columns-with-categorical-data-and-many-unique-values%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            5












            $begingroup$

            For categorical columns, you have two options :



            1. Entity Embeddings

            2. One Hot Vector

            For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



            Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



            Articles that explain Embeddings :



            • An Overview of Categorical Input Handling for Neural Networks


            • On learning embeddings for categorical data using Keras


            • Google Developers > Machine Learning > Embeddings: Categorical Input Data


            • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner






            share|improve this answer











            $endgroup$

















              5












              $begingroup$

              For categorical columns, you have two options :



              1. Entity Embeddings

              2. One Hot Vector

              For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



              Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



              Articles that explain Embeddings :



              • An Overview of Categorical Input Handling for Neural Networks


              • On learning embeddings for categorical data using Keras


              • Google Developers > Machine Learning > Embeddings: Categorical Input Data


              • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner






              share|improve this answer











              $endgroup$















                5












                5








                5





                $begingroup$

                For categorical columns, you have two options :



                1. Entity Embeddings

                2. One Hot Vector

                For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



                Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



                Articles that explain Embeddings :



                • An Overview of Categorical Input Handling for Neural Networks


                • On learning embeddings for categorical data using Keras


                • Google Developers > Machine Learning > Embeddings: Categorical Input Data


                • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner






                share|improve this answer











                $endgroup$



                For categorical columns, you have two options :



                1. Entity Embeddings

                2. One Hot Vector

                For a column with 145 values, I would use one hot encoding and Embedding for ~3k values. This decision might change depending on overall number of features.



                Embeddings map feature values into a 1D vector so that model knows NYC, Paris, London are similar cities in one aspect (size) and very different in other aspects. So, instead of using ~3k column of features, model will have ~50 columns of vector representation.



                Articles that explain Embeddings :



                • An Overview of Categorical Input Handling for Neural Networks


                • On learning embeddings for categorical data using Keras


                • Google Developers > Machine Learning > Embeddings: Categorical Input Data


                • Exploring Embeddings for Categorical Variables with Keras by Florian Teschner







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 8 at 15:10

























                answered Apr 8 at 12:05









                Shamit VermaShamit Verma

                1,5741314




                1,5741314



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48875%2fhow-to-handle-columns-with-categorical-data-and-many-unique-values%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Club Baloncesto Breogán Índice Historia | Pavillón | Nome | O Breogán na cultura popular | Xogadores | Adestradores | Presidentes | Palmarés | Historial | Líderes | Notas | Véxase tamén | Menú de navegacióncbbreogan.galCadroGuía oficial da ACB 2009-10, páxina 201Guía oficial ACB 1992, páxina 183. Editorial DB.É de 6.500 espectadores sentados axeitándose á última normativa"Estudiantes Junior, entre as mellores canteiras"o orixinalHemeroteca El Mundo Deportivo, 16 setembro de 1970, páxina 12Historia do BreogánAlfredo Pérez, o último canoneiroHistoria C.B. BreogánHemeroteca de El Mundo DeportivoJimmy Wright, norteamericano do Breogán deixará Lugo por ameazas de morteResultados de Breogán en 1986-87Resultados de Breogán en 1990-91Ficha de Velimir Perasović en acb.comResultados de Breogán en 1994-95Breogán arrasa al Barça. "El Mundo Deportivo", 27 de setembro de 1999, páxina 58CB Breogán - FC BarcelonaA FEB invita a participar nunha nova Liga EuropeaCharlie Bell na prensa estatalMáximos anotadores 2005Tempada 2005-06 : Tódolos Xogadores da Xornada""Non quero pensar nunha man negra, mais pregúntome que está a pasar""o orixinalRaúl López, orgulloso dos xogadores, presume da boa saúde económica do BreogánJulio González confirma que cesa como presidente del BreogánHomenaxe a Lisardo GómezA tempada do rexurdimento celesteEntrevista a Lisardo GómezEl COB dinamita el Pazo para forzar el quinto (69-73)Cafés Candelas, patrocinador del CB Breogán"Suso Lázare, novo presidente do Breogán"o orixinalCafés Candelas Breogán firma el mayor triunfo de la historiaEl Breogán realizará 17 homenajes por su cincuenta aniversario"O Breogán honra ao seu fundador e primeiro presidente"o orixinalMiguel Giao recibiu a homenaxe do PazoHomenaxe aos primeiros gladiadores celestesO home que nos amosa como ver o Breo co corazónTita Franco será homenaxeada polos #50anosdeBreoJulio Vila recibirá unha homenaxe in memoriam polos #50anosdeBreo"O Breogán homenaxeará aos seus aboados máis veteráns"Pechada ovación a «Capi» Sanmartín e Ricardo «Corazón de González»Homenaxe por décadas de informaciónPaco García volve ao Pazo con motivo do 50 aniversario"Resultados y clasificaciones""O Cafés Candelas Breogán, campión da Copa Princesa""O Cafés Candelas Breogán, equipo ACB"C.B. Breogán"Proxecto social"o orixinal"Centros asociados"o orixinalFicha en imdb.comMario Camus trata la recuperación del amor en 'La vieja música', su última película"Páxina web oficial""Club Baloncesto Breogán""C. B. Breogán S.A.D."eehttp://www.fegaba.com

                    Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070

                    Cegueira Índice Epidemioloxía | Deficiencia visual | Tipos de cegueira | Principais causas de cegueira | Tratamento | Técnicas de adaptación e axudas | Vida dos cegos | Primeiros auxilios | Crenzas respecto das persoas cegas | Crenzas das persoas cegas | O neno deficiente visual | Aspectos psicolóxicos da cegueira | Notas | Véxase tamén | Menú de navegación54.054.154.436928256blindnessDicionario da Real Academia GalegaPortal das Palabras"International Standards: Visual Standards — Aspects and Ranges of Vision Loss with Emphasis on Population Surveys.""Visual impairment and blindness""Presentan un plan para previr a cegueira"o orixinalACCDV Associació Catalana de Cecs i Disminuïts Visuals - PMFTrachoma"Effect of gene therapy on visual function in Leber's congenital amaurosis"1844137110.1056/NEJMoa0802268Cans guía - os mellores amigos dos cegosArquivadoEscola de cans guía para cegos en Mortágua, PortugalArquivado"Tecnología para ciegos y deficientes visuales. Recopilación de recursos gratuitos en la Red""Colorino""‘COL.diesis’, escuchar los sonidos del color""COL.diesis: Transforming Colour into Melody and Implementing the Result in a Colour Sensor Device"o orixinal"Sistema de desarrollo de sinestesia color-sonido para invidentes utilizando un protocolo de audio""Enseñanza táctil - geometría y color. Juegos didácticos para niños ciegos y videntes""Sistema Constanz"L'ocupació laboral dels cecs a l'Estat espanyol està pràcticament equiparada a la de les persones amb visió, entrevista amb Pedro ZuritaONCE (Organización Nacional de Cegos de España)Prevención da cegueiraDescrición de deficiencias visuais (Disc@pnet)Braillín, un boneco atractivo para calquera neno, con ou sen discapacidade, que permite familiarizarse co sistema de escritura e lectura brailleAxudas Técnicas36838ID00897494007150-90057129528256DOID:1432HP:0000618D001766C10.597.751.941.162C97109C0155020