Wordcount in a field (all and unique) - is there a more elegant/optimal way?Efficient merging (removing duplicates) of arraysHow do I select arrays that are not empty?How can I get the page size of a Postgres database?How to find the count of words (all and unique) in a column row-wise?Is it ever a good idea to denormalize for integrity?Which schema is better for a shopping project?PostgreSQL Primary key disappears from test tableBulk insert M:N relation in PostgreSQLWhat indexing to be used over string columns(text type) with string length around 3000 charactersNon-integer primary key considerationsIs there a better way than this to split a string and get the first substring?Help with design of sales databaseDatabase design for varying amount of relationsPostgreSQL performance with (col = value or col is NULL)

Can any NP-Complete Problem be solved using at most polynomial space (but while using exponential time?)

How do I set an alias to a terminal line?

Why cruise at 7000' in an A319?

Vanishing of certain coefficients coming from Coxeter groups

Why is the voltage measurement of this circuit different when the switch is on?

Is adding a new player (or players) a DM decision, or a group decision?

How to make clear to people I don't want to answer their "Where are you from?" question?

Can humans ever directly see a few photons at a time? Can a human see a single photon?

Iterate MapThread with matrices

A STL-like vector implementation in C++

Intuition for capacitors in series

Can Ogre clerics use Purify Food and Drink on humanoid characters?

Should developer taking test phones home or put in office?

Is my Rep in Stack-Exchange Form?

Is it possible writing coservation of relativistic energy in this naive way?

“D’entre eux” to mean “of them”

What are the penalties for overstaying in USA?

How do I respond to requests for a "guarantee" not to leave after a few months?

Sci fi short story, robot city that nags people about health

Apply brace expansion in "reverse order"

Source for "the land of Israel makes its inhabitants wise"

Can the negators "jamais, rien, personne, plus, ni, aucun" be used in a single sentence?

How would a drone work in centrifugal force generated "gravity"?

Swapping rooks in a 4x4 board



Wordcount in a field (all and unique) - is there a more elegant/optimal way?


Efficient merging (removing duplicates) of arraysHow do I select arrays that are not empty?How can I get the page size of a Postgres database?How to find the count of words (all and unique) in a column row-wise?Is it ever a good idea to denormalize for integrity?Which schema is better for a shopping project?PostgreSQL Primary key disappears from test tableBulk insert M:N relation in PostgreSQLWhat indexing to be used over string columns(text type) with string length around 3000 charactersNon-integer primary key considerationsIs there a better way than this to split a string and get the first substring?Help with design of sales databaseDatabase design for varying amount of relationsPostgreSQL performance with (col = value or col is NULL)






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








4















Answering this question,



Given this table (constructed from the question):



CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


produce this result:



 id | word_count | unique_word_count | Description 
---------+------------+-------------------+---------------
1 | 4 | 4 | What a great day
2 | 7 | 6 | This is a product. It is useful


I gave the (correct) answer, which you can find here.



However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.



So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.



Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.



I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.










share|improve this question

















This question has an open bounty worth +100
reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.


Looking for an answer drawing from credible and/or official sources.


The question itself contains the requirements for the bounty.






















    4















    Answering this question,



    Given this table (constructed from the question):



    CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

    INSERT INTO wordcount (description) VALUES ('What a great day');
    INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


    produce this result:



     id | word_count | unique_word_count | Description 
    ---------+------------+-------------------+---------------
    1 | 4 | 4 | What a great day
    2 | 7 | 6 | This is a product. It is useful


    I gave the (correct) answer, which you can find here.



    However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.



    So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.



    Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.



    I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.










    share|improve this question

















    This question has an open bounty worth +100
    reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.


    Looking for an answer drawing from credible and/or official sources.


    The question itself contains the requirements for the bounty.


















      4












      4








      4








      Answering this question,



      Given this table (constructed from the question):



      CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

      INSERT INTO wordcount (description) VALUES ('What a great day');
      INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


      produce this result:



       id | word_count | unique_word_count | Description 
      ---------+------------+-------------------+---------------
      1 | 4 | 4 | What a great day
      2 | 7 | 6 | This is a product. It is useful


      I gave the (correct) answer, which you can find here.



      However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.



      So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.



      Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.



      I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.










      share|improve this question
















      Answering this question,



      Given this table (constructed from the question):



      CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

      INSERT INTO wordcount (description) VALUES ('What a great day');
      INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');


      produce this result:



       id | word_count | unique_word_count | Description 
      ---------+------------+-------------------+---------------
      1 | 4 | 4 | What a great day
      2 | 7 | 6 | This is a product. It is useful


      I gave the (correct) answer, which you can find here.



      However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b'] and my solution broke down completely - for starters, the string wouldn't even INSERT into the table.



      So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.



      Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.



      I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.







      postgresql string-manipulation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jun 14 at 13:38







      Vérace

















      asked Jun 6 at 12:07









      VéraceVérace

      16.9k3 gold badges36 silver badges54 bronze badges




      16.9k3 gold badges36 silver badges54 bronze badges






      This question has an open bounty worth +100
      reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.


      Looking for an answer drawing from credible and/or official sources.


      The question itself contains the requirements for the bounty.








      This question has an open bounty worth +100
      reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.


      Looking for an answer drawing from credible and/or official sources.


      The question itself contains the requirements for the bounty.






















          2 Answers
          2






          active

          oldest

          votes


















          4














          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)





          share|improve this answer

























          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53


















          0














          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?





          share|improve this answer

























          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19













          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "182"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f239958%2fwordcount-in-a-field-all-and-unique-is-there-a-more-elegant-optimal-way%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          4














          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)





          share|improve this answer

























          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53















          4














          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)





          share|improve this answer

























          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53













          4












          4








          4







          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)





          share|improve this answer















          First step obviously is to create the table and data (as per the question mentioned):



          CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

          INSERT INTO wordcount (description) VALUES ($$What a great day$$);
          INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
          INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);


          The first "lifesaver" was the dollar quoting ($$) - a really neat
          PostgreSQL feature. I was really floundering before I came across
          this - couldn't even get the data into the table (trying
          backslashes, double quotes &c.)



          My final SQL looks like this (fiddle here):



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT id,
          COUNT(word),
          COUNT(DISTINCT(word)),
          description
          FROM cte1
          WHERE LENGTH(word) > 0
          GROUP BY id, description
          ORDER BY id;


          Result:



          id Word_count Distinct_count description
          1 4 4 What a great day
          2 7 6 This is a product. It is useful
          3 2 2 ['a', ' ', ' ', 'b']





          Logic explained:



          I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER() function would solve that - it's not core to the question.



          Step 1:



          SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
          -- Keeping the id field helps clarity, even if superfluous.


          Result:



          id regexp_split_to_table
          1 What a great day
          2 This is a product. It is useful
          3 ['a'
          3 ' '
          3 ' '
          3 'b']


          Step 2 (remove all non-space, non-alpha)



          SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
          FROM wordcount;

          -- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
          -- be counted as different! Again, keeping the id field makes things clearer,
          -- even if not strictly necessary for purists


          Result:



          id regexp_replace
          1 What a great day
          2 This is a product It is useful
          3 a
          3
          3
          3 b


          Step 3 (put the strings into an array):



          SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
          FROM wordcount;
          -- id again - not strictly necessary at this step.


          Result:



          id string_to_array
          1 What,a,great,day
          2 This,is,a,product,It,is,useful
          3 a
          3 "","",""
          3 "","",""
          3 "",b


          Finally, the answer itself - UNNEST and then select those words LENGTH > 0 grouping by id and description.



          i.e. SELECT the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST... throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!



          WITH cte1 AS
          (
          SELECT id,
          UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
          REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
          description
          FROM wordcount
          )
          SELECT blah... (see above)






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jun 7 at 14:10

























          answered Jun 6 at 12:17









          VéraceVérace

          16.9k3 gold badges36 silver badges54 bronze badges




          16.9k3 gold badges36 silver badges54 bronze badges












          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53

















          • jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

            – Arkhena
            Jun 6 at 12:51











          • @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

            – Vérace
            Jun 6 at 12:53
















          jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

          – Arkhena
          Jun 6 at 12:51





          jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)

          – Arkhena
          Jun 6 at 12:51













          @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

          – Vérace
          Jun 6 at 12:53





          @Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)

          – Vérace
          Jun 6 at 12:53













          0














          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?





          share|improve this answer

























          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19















          0














          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?





          share|improve this answer

























          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19













          0












          0








          0







          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?





          share|improve this answer















          As to your solution: smart and with a solid explanation. But what about these cases: '', NULL, '"§$%', '-'? No words. The count should be 0 - but your solution drops these rows altogether.



          Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...



          Regular expression based string processing



          Similar to your solution, with some alternative suggestions:



          SELECT id
          , COALESCE(cardinality(arr), 0) AS word_count
          , unique_word_count
          , description
          FROM (
          SELECT *
          , string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
          FROM wordcount
          ) a
          LEFT JOIN LATERAL (
          SELECT count(DISTINCT elem) AS unique_word_count
          FROM unnest(arr) elem
          ) b ON true;


          db<>fiddle here (extended test case)



          The core is regexp_replace(description, 'W+', ' ', 'g') to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.



          Followed up by cheap trim() to remove leading/trailing spaces, and string_to_array() to convert the prepared string to an array.



          Get word_count from the array directly. Again: cheap.



          The unique_word_count from a LATERAL subquery with count(DISTINCT ...). That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.



          The COALESCE in the outer SELECT takes care of NULL input (the original question did not mention a NOT NULL constraint). Optional, in case you need 0 instead of NULL.



          Or (faster in a quick test with short strings):



          SELECT id
          , count(*) AS word_count
          , count(DISTINCT elem) AS unique_word_count
          , description
          FROM (
          SELECT id, description
          , unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
          FROM wordcount
          ) sub
          GROUP BY id, description;


          This drops rows with 0 words like your answer does.



          (Ab-)using text search parser



          Using the text search function ts_parse() is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":



          SELECT * FROM ts_token_type('default')


          For only "ASCII Words":
          (Unlike above, the underscore (_) is not treated as word character here):



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          To keep _ from separating words, use simple replace() first:



          SELECT w.id
          , count(*) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
          WHERE t.tokid = 1 -- 'asciiword'
          GROUP BY w.id;


          Again, to keep all rows:



          SELECT w.id
          , count(token) AS word_count
          , count(DISTINCT token) AS unique_word_count
          , w.description
          FROM wordcount w
          LEFT JOIN LATERAL (
          SELECT t.token
          FROM ts_parse('default', w.description) t
          WHERE t.tokid = 1 -- 'asciiword'
          ) t ON true
          GROUP BY w.id;


          db<>fiddle here



          Related:



          • Efficient merging (removing duplicates) of arrays

          • How do I select arrays that are not empty?

          • What is the difference between LATERAL and a subquery in PostgreSQL?






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jun 7 at 16:14

























          answered Jun 7 at 3:55









          Erwin BrandstetterErwin Brandstetter

          98.5k9 gold badges203 silver badges325 bronze badges




          98.5k9 gold badges203 silver badges325 bronze badges












          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19

















          • Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

            – Vérace
            Jun 7 at 15:56











          • For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

            – Erwin Brandstetter
            Jun 7 at 16:19
















          Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

          – Vérace
          Jun 7 at 15:56





          Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.

          – Vérace
          Jun 7 at 15:56













          For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

          – Erwin Brandstetter
          Jun 7 at 16:19





          For the regular expression, an underscore (_) is a word character, but the dash (-) is not. So you get under_score ,under and score - 3 "words". I also clarified ts_parse()which knows various kinds of "words". I added some more above to clarify.

          – Erwin Brandstetter
          Jun 7 at 16:19

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Database Administrators Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f239958%2fwordcount-in-a-field-all-and-unique-is-there-a-more-elegant-optimal-way%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Wikipedia:Vital articles Мазмуну Biography - Өмүр баян Philosophy and psychology - Философия жана психология Religion - Дин Social sciences - Коомдук илимдер Language and literature - Тил жана адабият Science - Илим Technology - Технология Arts and recreation - Искусство жана эс алуу History and geography - Тарых жана география Навигация менюсу

          Bruxelas-Capital Índice Historia | Composición | Situación lingüística | Clima | Cidades irmandadas | Notas | Véxase tamén | Menú de navegacióneO uso das linguas en Bruxelas e a situación do neerlandés"Rexión de Bruxelas Capital"o orixinalSitio da rexiónPáxina de Bruselas no sitio da Oficina de Promoción Turística de Valonia e BruxelasMapa Interactivo da Rexión de Bruxelas-CapitaleeWorldCat332144929079854441105155190212ID28008674080552-90000 0001 0666 3698n94104302ID540940339365017018237

          What should I write in an apology letter, since I have decided not to join a company after accepting an offer letterShould I keep looking after accepting a job offer?What should I do when I've been verbally told I would get an offer letter, but still haven't gotten one after 4 weeks?Do I accept an offer from a company that I am not likely to join?New job hasn't confirmed starting date and I want to give current employer as much notice as possibleHow should I address my manager in my resignation letter?HR delayed background verification, now jobless as resignedNo email communication after accepting a formal written offer. How should I phrase the call?What should I do if after receiving a verbal offer letter I am informed that my written job offer is put on hold due to some internal issues?Should I inform the current employer that I am about to resign within 1-2 weeks since I have signed the offer letter and waiting for visa?What company will do, if I send their offer letter to another company