Wordcount in a field (all and unique) - is there a more elegant/optimal way?Efficient merging (removing duplicates) of arraysHow do I select arrays that are not empty?How can I get the page size of a Postgres database?How to find the count of words (all and unique) in a column row-wise?Is it ever a good idea to denormalize for integrity?Which schema is better for a shopping project?PostgreSQL Primary key disappears from test tableBulk insert M:N relation in PostgreSQLWhat indexing to be used over string columns(text type) with string length around 3000 charactersNon-integer primary key considerationsIs there a better way than this to split a string and get the first substring?Help with design of sales databaseDatabase design for varying amount of relationsPostgreSQL performance with (col = value or col is NULL)

Can any NP-Complete Problem be solved using at most polynomial space (but while using exponential time?)

How do I set an alias to a terminal line?

Why cruise at 7000' in an A319?

Vanishing of certain coefficients coming from Coxeter groups

Why is the voltage measurement of this circuit different when the switch is on?

Is adding a new player (or players) a DM decision, or a group decision?

How to make clear to people I don't want to answer their "Where are you from?" question?

Can humans ever directly see a few photons at a time? Can a human see a single photon?

Iterate MapThread with matrices

A STL-like vector implementation in C++

Intuition for capacitors in series

Can Ogre clerics use Purify Food and Drink on humanoid characters?

Should developer taking test phones home or put in office?

Is my Rep in Stack-Exchange Form?

Is it possible writing coservation of relativistic energy in this naive way?

“D’entre eux” to mean “of them”

What are the penalties for overstaying in USA?

How do I respond to requests for a "guarantee" not to leave after a few months?

Sci fi short story, robot city that nags people about health

Apply brace expansion in "reverse order"

Source for "the land of Israel makes its inhabitants wise"

Can the negators "jamais, rien, personne, plus, ni, aucun" be used in a single sentence?

How would a drone work in centrifugal force generated "gravity"?

Swapping rooks in a 4x4 board



Wordcount in a field (all and unique) - is there a more elegant/optimal way?


Efficient merging (removing duplicates) of arraysHow do I select arrays that are not empty?How can I get the page size of a Postgres database?How to find the count of words (all and unique) in a column row-wise?Is it ever a good idea to denormalize for integrity?Which schema is better for a shopping project?PostgreSQL Primary key disappears from test tableBulk insert M:N relation in PostgreSQLWhat indexing to be used over string columns(text type) with string length around 3000 charactersNon-integer primary key considerationsIs there a better way than this to split a string and get the first substring?Help with design of sales databaseDatabase design for varying amount of relationsPostgreSQL performance with (col = value or col is NULL)






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








4















Answering this question,



Given this table (constructed from the question):



CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


produce this result:



 id | word_count | unique_word_count | Description 
---------+------------+-------------------+---------------
1 | 4 | 4 | What a great day
2 | 7 | 6 | This is a product. It is useful


I gave the (correct) answer, which you can find here.



However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.



So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.



Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.



I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.










share|improve this question

















This question has an open bounty worth +100
reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.


Looking for an answer drawing from credible and/or official sources.


The question itself contains the requirements for the bounty.






















    4















    Answering this question,



    Given this table (constructed from the question):



    CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

    INSERT INTO wordcount (description) VALUES ('What a great day');
    INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


    produce this result:



     id | word_count | unique_word_count | Description 
    ---------+------------+-------------------+---------------
    1 | 4 | 4 | What a great day
    2 | 7 | 6 | This is a product. It is useful


    I gave the (correct) answer, which you can find here.



    However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.



    So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.



    Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.



    I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.










    share|improve this question

















    This question has an open bounty worth +100
    reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.


    Looking for an answer drawing from credible and/or official sources.


    The question itself contains the requirements for the bounty.


















      4












      4








      4








      Answering this question,



      Given this table (constructed from the question):



      CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

      INSERT INTO wordcount (description) VALUES ('What a great day');
      INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


      produce this result:



       id | word_count | unique_word_count | Description 
      ---------+------------+-------------------+---------------
      1 | 4 | 4 | What a great day
      2 | 7 | 6 | This is a product. It is useful


      I gave the (correct) answer, which you can find here.



      However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.



      So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.



      Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.



      I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.










      share|improve this question
















      Answering this question,



      Given this table (constructed from the question):



      CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

      INSERT INTO wordcount (description) VALUES ('What a great day');
      INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


      produce this result:



       id | word_count | unique_word_count | Description 
      ---------+------------+-------------------+---------------
      1 | 4 | 4 | What a great day
      2 | 7 | 6 | This is a product. It is useful


      I gave the (correct) answer, which you can find here.



      However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.



      So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.



      Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.



      I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.







      postgresql string-manipulation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jun 14 at 13:38







      Vérace

















      asked Jun 6 at 12:07









      VéraceVérace

      16.9k3 gold badges36 silver badges54 bronze badges




      16.9k3 gold badges36 silver badges54 bronze badges






      This question has an open bounty worth +100
      reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.


      Looking for an answer drawing from credible and/or official sources.


      The question itself contains the requirements for the bounty.








      This question has an open bounty worth +100
      reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.


      Looking for an answer drawing from credible and/or official sources.


      The question itself contains the requirements for the bounty.






















          2 Answers
          2






          active

          oldest

          votes


















          4














          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)





          share|improve this answer

























          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53


















          0














          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?





          share|improve this answer

























          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19













          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "182"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f239958%2fwordcount-in-a-field-all-and-unique-is-there-a-more-elegant-optimal-way%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          4














          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)





          share|improve this answer

























          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53















          4














          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)





          share|improve this answer

























          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53













          4












          4








          4







          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)





          share|improve this answer















          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jun 7 at 14:10

























          answered Jun 6 at 12:17









          VéraceVérace

          16.9k3 gold badges36 silver badges54 bronze badges




          16.9k3 gold badges36 silver badges54 bronze badges












          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53

















          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53
















          jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

          – Arkhena
          Jun 6 at 12:51





          jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

          – Arkhena
          Jun 6 at 12:51













          @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

          – Vérace
          Jun 6 at 12:53





          @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

          – Vérace
          Jun 6 at 12:53













          0














          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?





          share|improve this answer

























          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19















          0














          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?





          share|improve this answer

























          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19













          0












          0








          0







          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?





          share|improve this answer















          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jun 7 at 16:14

























          answered Jun 7 at 3:55









          Erwin BrandstetterErwin Brandstetter

          98.5k9 gold badges203 silver badges325 bronze badges




          98.5k9 gold badges203 silver badges325 bronze badges












          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19

















          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19
















          Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

          – Vérace
          Jun 7 at 15:56





          Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

          – Vérace
          Jun 7 at 15:56













          For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

          – Erwin Brandstetter
          Jun 7 at 16:19





          For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

          – Erwin Brandstetter
          Jun 7 at 16:19

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Database Administrators Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f239958%2fwordcount-in-a-field-all-and-unique-is-there-a-more-elegant-optimal-way%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Club Baloncesto Breogán Índice Historia | Pavillón | Nome | O Breogán na cultura popular | Xogadores | Adestradores | Presidentes | Palmarés | Historial | Líderes | Notas | Véxase tamén | Menú de navegacióncbbreogan.galCadroGuía oficial da ACB 2009-10, páxina 201Guía oficial ACB 1992, páxina 183. Editorial DB.É de 6.500 espectadores sentados axeitándose á última normativa"Estudiantes Junior, entre as mellores canteiras"o orixinalHemeroteca El Mundo Deportivo, 16 setembro de 1970, páxina 12Historia do BreogánAlfredo Pérez, o último canoneiroHistoria C.B. BreogánHemeroteca de El Mundo DeportivoJimmy Wright, norteamericano do Breogán deixará Lugo por ameazas de morteResultados de Breogán en 1986-87Resultados de Breogán en 1990-91Ficha de Velimir Perasović en acb.comResultados de Breogán en 1994-95Breogán arrasa al Barça. "El Mundo Deportivo", 27 de setembro de 1999, páxina 58CB Breogán - FC BarcelonaA FEB invita a participar nunha nova Liga EuropeaCharlie Bell na prensa estatalMáximos anotadores 2005Tempada 2005-06 : Tódolos Xogadores da Xornada""Non quero pensar nunha man negra, mais pregúntome que está a pasar""o orixinalRaúl López, orgulloso dos xogadores, presume da boa saúde económica do BreogánJulio González confirma que cesa como presidente del BreogánHomenaxe a Lisardo GómezA tempada do rexurdimento celesteEntrevista a Lisardo GómezEl COB dinamita el Pazo para forzar el quinto (69-73)Cafés Candelas, patrocinador del CB Breogán"Suso Lázare, novo presidente do Breogán"o orixinalCafés Candelas Breogán firma el mayor triunfo de la historiaEl Breogán realizará 17 homenajes por su cincuenta aniversario"O Breogán honra ao seu fundador e primeiro presidente"o orixinalMiguel Giao recibiu a homenaxe do PazoHomenaxe aos primeiros gladiadores celestesO home que nos amosa como ver o Breo co corazónTita Franco será homenaxeada polos #50anosdeBreoJulio Vila recibirá unha homenaxe in memoriam polos #50anosdeBreo"O Breogán homenaxeará aos seus aboados máis veteráns"Pechada ovación a «Capi» Sanmartín e Ricardo «Corazón de González»Homenaxe por décadas de informaciónPaco García volve ao Pazo con motivo do 50 aniversario"Resultados y clasificaciones""O Cafés Candelas Breogán, campión da Copa Princesa""O Cafés Candelas Breogán, equipo ACB"C.B. Breogán"Proxecto social"o orixinal"Centros asociados"o orixinalFicha en imdb.comMario Camus trata la recuperación del amor en 'La vieja música', su última película"Páxina web oficial""Club Baloncesto Breogán""C. B. Breogán S.A.D."eehttp://www.fegaba.com

          Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070

          Cegueira Índice Epidemioloxía | Deficiencia visual | Tipos de cegueira | Principais causas de cegueira | Tratamento | Técnicas de adaptación e axudas | Vida dos cegos | Primeiros auxilios | Crenzas respecto das persoas cegas | Crenzas das persoas cegas | O neno deficiente visual | Aspectos psicolóxicos da cegueira | Notas | Véxase tamén | Menú de navegación54.054.154.436928256blindnessDicionario da Real Academia GalegaPortal das Palabras"International Standards: Visual Standards — Aspects and Ranges of Vision Loss with Emphasis on Population Surveys.""Visual impairment and blindness""Presentan un plan para previr a cegueira"o orixinalACCDV Associació Catalana de Cecs i Disminuïts Visuals - PMFTrachoma"Effect of gene therapy on visual function in Leber's congenital amaurosis"1844137110.1056/NEJMoa0802268Cans guía - os mellores amigos dos cegosArquivadoEscola de cans guía para cegos en Mortágua, PortugalArquivado"Tecnología para ciegos y deficientes visuales. Recopilación de recursos gratuitos en la Red""Colorino""‘COL.diesis’, escuchar los sonidos del color""COL.diesis: Transforming Colour into Melody and Implementing the Result in a Colour Sensor Device"o orixinal"Sistema de desarrollo de sinestesia color-sonido para invidentes utilizando un protocolo de audio""Enseñanza táctil - geometría y color. Juegos didácticos para niños ciegos y videntes""Sistema Constanz"L'ocupació laboral dels cecs a l'Estat espanyol està pràcticament equiparada a la de les persones amb visió, entrevista amb Pedro ZuritaONCE (Organización Nacional de Cegos de España)Prevención da cegueiraDescrición de deficiencias visuais (Disc@pnet)Braillín, un boneco atractivo para calquera neno, con ou sen discapacidade, que permite familiarizarse co sistema de escritura e lectura brailleAxudas Técnicas36838ID00897494007150-90057129528256DOID:1432HP:0000618D001766C10.597.751.941.162C97109C0155020