Wordcount in a field (all and unique) - is there a more elegant/optimal way?Efficient merging (removing duplicates) of arraysHow do I select arrays that are not empty?How can I get the page size of a Postgres database?How to find the count of words (all and unique) in a column row-wise?Is it ever a good idea to denormalize for integrity?Which schema is better for a shopping project?PostgreSQL Primary key disappears from test tableBulk insert M:N relation in PostgreSQLWhat indexing to be used over string columns(text type) with string length around 3000 charactersNon-integer primary key considerationsIs there a better way than this to split a string and get the first substring?Help with design of sales databaseDatabase design for varying amount of relationsPostgreSQL performance with (col = value or col is NULL)
Can any NP-Complete Problem be solved using at most polynomial space (but while using exponential time?)
How do I set an alias to a terminal line?
Why cruise at 7000' in an A319?
Vanishing of certain coefficients coming from Coxeter groups
Why is the voltage measurement of this circuit different when the switch is on?
Is adding a new player (or players) a DM decision, or a group decision?
How to make clear to people I don't want to answer their "Where are you from?" question?
Can humans ever directly see a few photons at a time? Can a human see a single photon?
Iterate MapThread with matrices
A STL-like vector implementation in C++
Intuition for capacitors in series
Can Ogre clerics use Purify Food and Drink on humanoid characters?
Should developer taking test phones home or put in office?
Is my Rep in Stack-Exchange Form?
Is it possible writing coservation of relativistic energy in this naive way?
“D’entre eux” to mean “of them”
What are the penalties for overstaying in USA?
How do I respond to requests for a "guarantee" not to leave after a few months?
Sci fi short story, robot city that nags people about health
Apply brace expansion in "reverse order"
Source for "the land of Israel makes its inhabitants wise"
Can the negators "jamais, rien, personne, plus, ni, aucun" be used in a single sentence?
How would a drone work in centrifugal force generated "gravity"?
Swapping rooks in a 4x4 board
Wordcount in a field (all and unique) - is there a more elegant/optimal way?
Efficient merging (removing duplicates) of arraysHow do I select arrays that are not empty?How can I get the page size of a Postgres database?How to find the count of words (all and unique) in a column row-wise?Is it ever a good idea to denormalize for integrity?Which schema is better for a shopping project?PostgreSQL Primary key disappears from test tableBulk insert M:N relation in PostgreSQLWhat indexing to be used over string columns(text type) with string length around 3000 charactersNon-integer primary key considerationsIs there a better way than this to split a string and get the first substring?Help with design of sales databaseDatabase design for varying amount of relationsPostgreSQL performance with (col = value or col is NULL)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
Answering this question,
Given this table (constructed from the question):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');
produce this result:
id | word_count | unique_word_count | Description
---------+------------+-------------------+---------------
1 | 4 | 4 | What a great day
2 | 7 | 6 | This is a product. It is useful
I gave the (correct) answer, which you can find here.
However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b']
and my solution broke down completely - for starters, the string wouldn't even INSERT
into the table.
So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.
Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.
I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.
postgresql string-manipulation
This question has an open bounty worth +100
reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.
Looking for an answer drawing from credible and/or official sources.
The question itself contains the requirements for the bounty.
add a comment |
Answering this question,
Given this table (constructed from the question):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');
produce this result:
id | word_count | unique_word_count | Description
---------+------------+-------------------+---------------
1 | 4 | 4 | What a great day
2 | 7 | 6 | This is a product. It is useful
I gave the (correct) answer, which you can find here.
However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b']
and my solution broke down completely - for starters, the string wouldn't even INSERT
into the table.
So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.
Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.
I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.
postgresql string-manipulation
This question has an open bounty worth +100
reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.
Looking for an answer drawing from credible and/or official sources.
The question itself contains the requirements for the bounty.
add a comment |
Answering this question,
Given this table (constructed from the question):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');
produce this result:
id | word_count | unique_word_count | Description
---------+------------+-------------------+---------------
1 | 4 | 4 | What a great day
2 | 7 | 6 | This is a product. It is useful
I gave the (correct) answer, which you can find here.
However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b']
and my solution broke down completely - for starters, the string wouldn't even INSERT
into the table.
So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.
Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.
I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.
postgresql string-manipulation
Answering this question,
Given this table (constructed from the question):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');
produce this result:
id | word_count | unique_word_count | Description
---------+------------+-------------------+---------------
1 | 4 | 4 | What a great day
2 | 7 | 6 | This is a product. It is useful
I gave the (correct) answer, which you can find here.
However in a comment, the OP then asked a further question - what if the string in question was ['a', ' ', ' ', 'b']
and my solution broke down completely - for starters, the string wouldn't even INSERT
into the table.
So, the question now is, how does one deal with strings like this - i.e. with apostrophes, square brackets &c. I'm going to give my own answer and also offer a bonus for a more elegant solution.
Solutions with multiple options will be highly prized as will those which show evidence of "thinking outside the box" (sorry for the cliché - but it fits here! :-) ). I'm also going to give a detailled explanation of my reasoning - that will earn kudos also! Options which mention other servers will also gain merit. Obviously, I can only award the bonus to one person, but I will upvote all decent answers.
I can only offer a bonus in two days - so I'll post my answer and will offer the bonus (+100) when I'm allowed to. Also, any solutions which deal with strings which my own can't deal with - haven't exhaustively tested yet.
postgresql string-manipulation
postgresql string-manipulation
edited Jun 14 at 13:38
Vérace
asked Jun 6 at 12:07
VéraceVérace
16.9k3 gold badges36 silver badges54 bronze badges
16.9k3 gold badges36 silver badges54 bronze badges
This question has an open bounty worth +100
reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.
Looking for an answer drawing from credible and/or official sources.
The question itself contains the requirements for the bounty.
This question has an open bounty worth +100
reputation from Vérace ending ending at 2019-06-21 13:13:16Z">in 6 hours.
Looking for an answer drawing from credible and/or official sources.
The question itself contains the requirements for the bounty.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
First step obviously is to create the table and data (as per the question mentioned):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);
The first "lifesaver" was the dollar quoting ($$
) - a really neat
PostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)
My final SQL looks like this (fiddle here):
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT id,
COUNT(word),
COUNT(DISTINCT(word)),
description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;
Result:
id Word_count Distinct_count description
1 4 4 What a great day
2 7 6 This is a product. It is useful
3 2 2 ['a', ' ', ' ', 'b']
Logic explained:
I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER()
function would solve that - it's not core to the question.
Step 1:
SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.
Result:
id regexp_split_to_table
1 What a great day
2 This is a product. It is useful
3 ['a'
3 ' '
3 ' '
3 'b']
Step 2 (remove all non-space, non-alpha)
SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
FROM wordcount;
-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer,
-- even if not strictly necessary for purists
Result:
id regexp_replace
1 What a great day
2 This is a product It is useful
3 a
3
3
3 b
Step 3 (put the strings into an array):
SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
FROM wordcount;
-- id again - not strictly necessary at this step.
Result:
id string_to_array
1 What,a,great,day
2 This,is,a,product,It,is,useful
3 a
3 "","",""
3 "","",""
3 "",b
Finally, the answer itself - UNNEST
and then select those words LENGTH > 0
grouping by id and description.
i.e. SELECT
the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST...
throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT blah... (see above)
jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)
– Arkhena
Jun 6 at 12:51
@Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)
– Vérace
Jun 6 at 12:53
add a comment |
As to your solution: smart and with a solid explanation. But what about these cases: ''
, NULL
, '"§$%'
, '-'
? No words. The count should be 0
- but your solution drops these rows altogether.
Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...
Regular expression based string processing
Similar to your solution, with some alternative suggestions:
SELECT id
, COALESCE(cardinality(arr), 0) AS word_count
, unique_word_count
, description
FROM (
SELECT *
, string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
FROM wordcount
) a
LEFT JOIN LATERAL (
SELECT count(DISTINCT elem) AS unique_word_count
FROM unnest(arr) elem
) b ON true;
db<>fiddle here (extended test case)
The core is regexp_replace(description, 'W+', ' ', 'g')
to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.
Followed up by cheap trim()
to remove leading/trailing spaces, and string_to_array()
to convert the prepared string to an array.
Get word_count
from the array directly. Again: cheap.
The unique_word_count
from a LATERAL
subquery with count(DISTINCT ...)
. That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.
The COALESCE
in the outer SELECT
takes care of NULL
input (the original question did not mention a NOT NULL
constraint). Optional, in case you need 0
instead of NULL
.
Or (faster in a quick test with short strings):
SELECT id
, count(*) AS word_count
, count(DISTINCT elem) AS unique_word_count
, description
FROM (
SELECT id, description
, unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
FROM wordcount
) sub
GROUP BY id, description;
This drops rows with 0 words like your answer does.
(Ab-)using text search parser
Using the text search function ts_parse()
is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":
SELECT * FROM ts_token_type('default')
For only "ASCII Words":
(Unlike above, the underscore (_
) is not treated as word character here):
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
To keep _
from separating words, use simple replace()
first:
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
Again, to keep all rows:
SELECT w.id
, count(token) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w
LEFT JOIN LATERAL (
SELECT t.token
FROM ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
) t ON true
GROUP BY w.id;
db<>fiddle here
Related:
- Efficient merging (removing duplicates) of arrays
- How do I select arrays that are not empty?
- What is the difference between LATERAL and a subquery in PostgreSQL?
Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.
– Vérace
Jun 7 at 15:56
For the regular expression, an underscore (_
) is a word character, but the dash (-
) is not. So you getunder_score
,under
andscore
- 3 "words". I also clarifiedts_parse()
which knows various kinds of "words". I added some more above to clarify.
– Erwin Brandstetter
Jun 7 at 16:19
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "182"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f239958%2fwordcount-in-a-field-all-and-unique-is-there-a-more-elegant-optimal-way%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
First step obviously is to create the table and data (as per the question mentioned):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);
The first "lifesaver" was the dollar quoting ($$
) - a really neat
PostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)
My final SQL looks like this (fiddle here):
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT id,
COUNT(word),
COUNT(DISTINCT(word)),
description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;
Result:
id Word_count Distinct_count description
1 4 4 What a great day
2 7 6 This is a product. It is useful
3 2 2 ['a', ' ', ' ', 'b']
Logic explained:
I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER()
function would solve that - it's not core to the question.
Step 1:
SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.
Result:
id regexp_split_to_table
1 What a great day
2 This is a product. It is useful
3 ['a'
3 ' '
3 ' '
3 'b']
Step 2 (remove all non-space, non-alpha)
SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
FROM wordcount;
-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer,
-- even if not strictly necessary for purists
Result:
id regexp_replace
1 What a great day
2 This is a product It is useful
3 a
3
3
3 b
Step 3 (put the strings into an array):
SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
FROM wordcount;
-- id again - not strictly necessary at this step.
Result:
id string_to_array
1 What,a,great,day
2 This,is,a,product,It,is,useful
3 a
3 "","",""
3 "","",""
3 "",b
Finally, the answer itself - UNNEST
and then select those words LENGTH > 0
grouping by id and description.
i.e. SELECT
the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST...
throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT blah... (see above)
jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)
– Arkhena
Jun 6 at 12:51
@Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)
– Vérace
Jun 6 at 12:53
add a comment |
First step obviously is to create the table and data (as per the question mentioned):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);
The first "lifesaver" was the dollar quoting ($$
) - a really neat
PostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)
My final SQL looks like this (fiddle here):
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT id,
COUNT(word),
COUNT(DISTINCT(word)),
description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;
Result:
id Word_count Distinct_count description
1 4 4 What a great day
2 7 6 This is a product. It is useful
3 2 2 ['a', ' ', ' ', 'b']
Logic explained:
I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER()
function would solve that - it's not core to the question.
Step 1:
SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.
Result:
id regexp_split_to_table
1 What a great day
2 This is a product. It is useful
3 ['a'
3 ' '
3 ' '
3 'b']
Step 2 (remove all non-space, non-alpha)
SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
FROM wordcount;
-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer,
-- even if not strictly necessary for purists
Result:
id regexp_replace
1 What a great day
2 This is a product It is useful
3 a
3
3
3 b
Step 3 (put the strings into an array):
SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
FROM wordcount;
-- id again - not strictly necessary at this step.
Result:
id string_to_array
1 What,a,great,day
2 This,is,a,product,It,is,useful
3 a
3 "","",""
3 "","",""
3 "",b
Finally, the answer itself - UNNEST
and then select those words LENGTH > 0
grouping by id and description.
i.e. SELECT
the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST...
throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT blah... (see above)
jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)
– Arkhena
Jun 6 at 12:51
@Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)
– Vérace
Jun 6 at 12:53
add a comment |
First step obviously is to create the table and data (as per the question mentioned):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);
The first "lifesaver" was the dollar quoting ($$
) - a really neat
PostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)
My final SQL looks like this (fiddle here):
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT id,
COUNT(word),
COUNT(DISTINCT(word)),
description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;
Result:
id Word_count Distinct_count description
1 4 4 What a great day
2 7 6 This is a product. It is useful
3 2 2 ['a', ' ', ' ', 'b']
Logic explained:
I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER()
function would solve that - it's not core to the question.
Step 1:
SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.
Result:
id regexp_split_to_table
1 What a great day
2 This is a product. It is useful
3 ['a'
3 ' '
3 ' '
3 'b']
Step 2 (remove all non-space, non-alpha)
SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
FROM wordcount;
-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer,
-- even if not strictly necessary for purists
Result:
id regexp_replace
1 What a great day
2 This is a product It is useful
3 a
3
3
3 b
Step 3 (put the strings into an array):
SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
FROM wordcount;
-- id again - not strictly necessary at this step.
Result:
id string_to_array
1 What,a,great,day
2 This,is,a,product,It,is,useful
3 a
3 "","",""
3 "","",""
3 "",b
Finally, the answer itself - UNNEST
and then select those words LENGTH > 0
grouping by id and description.
i.e. SELECT
the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST...
throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT blah... (see above)
First step obviously is to create the table and data (as per the question mentioned):
CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);
INSERT INTO wordcount (description) VALUES ($$What a great day$$);
INSERT INTO wordcount (description) VALUES ($$This is a product. It is useful$$);
INSERT INTO wordcount (description) VALUES ($$['a', ' ', ' ', 'b']$$);
The first "lifesaver" was the dollar quoting ($$
) - a really neat
PostgreSQL feature. I was really floundering before I came across
this - couldn't even get the data into the table (trying
backslashes, double quotes &c.)
My final SQL looks like this (fiddle here):
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT id,
COUNT(word),
COUNT(DISTINCT(word)),
description
FROM cte1
WHERE LENGTH(word) > 0
GROUP BY id, description
ORDER BY id;
Result:
id Word_count Distinct_count description
1 4 4 What a great day
2 7 6 This is a product. It is useful
3 2 2 ['a', ' ', ' ', 'b']
Logic explained:
I decided not to bother with capitalisation - i.e. "It" and "it" are different words in this case - if this is an issue, the simple addition of an UPPER()
function would solve that - it's not core to the question.
Step 1:
SELECT id, REGEXP_SPLIT_TO_TABLE(description, ',') FROM wordcount;
-- Keeping the id field helps clarity, even if superfluous.
Result:
id regexp_split_to_table
1 What a great day
2 This is a product. It is useful
3 ['a'
3 ' '
3 ' '
3 'b']
Step 2 (remove all non-space, non-alpha)
SELECT id, REGEXP_REPLACE(REGEXP_SPLIT_TO_TABLE(description, ','), '[^a-zA-Zs]', '', 'g')
FROM wordcount;
-- Remove all non-alpha, non-spaces. Otherwise the words "product" and "product." would
-- be counted as different! Again, keeping the id field makes things clearer,
-- even if not strictly necessary for purists
Result:
id regexp_replace
1 What a great day
2 This is a product It is useful
3 a
3
3
3 b
Step 3 (put the strings into an array):
SELECT id, STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')
FROM wordcount;
-- id again - not strictly necessary at this step.
Result:
id string_to_array
1 What,a,great,day
2 This,is,a,product,It,is,useful
3 a
3 "","",""
3 "","",""
3 "",b
Finally, the answer itself - UNNEST
and then select those words LENGTH > 0
grouping by id and description.
i.e. SELECT
the necessary from the following cte (Common Table Expression) - the cte isn't strictly necessary - I could have used the UNNEST...
throughout my final query, but that would have been horrible to read and debug. It's the reason Common Table Expressions were invented!
WITH cte1 AS
(
SELECT id,
UNNEST(STRING_TO_ARRAY(REGEXP_REPLACE(
REGEXP_SPLIT_TO_TABLE(description, ','), '[^ws]', '', 'g'), ' ')) as "word",
description
FROM wordcount
)
SELECT blah... (see above)
edited Jun 7 at 14:10
answered Jun 6 at 12:17
VéraceVérace
16.9k3 gold badges36 silver badges54 bronze badges
16.9k3 gold badges36 silver badges54 bronze badges
jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)
– Arkhena
Jun 6 at 12:51
@Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)
– Vérace
Jun 6 at 12:53
add a comment |
jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)
– Arkhena
Jun 6 at 12:51
@Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)
– Vérace
Jun 6 at 12:53
jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)
– Arkhena
Jun 6 at 12:51
jsute an simple addition: with Postgres 12, you'll have generated columns that will allow to store that information. (see postgresql.org/docs/devel/ddl-generated-columns.html)
– Arkhena
Jun 6 at 12:51
@Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)
– Vérace
Jun 6 at 12:53
@Arkhena - thanks for that, but see my first comment here - great minds thinking alike! :-)
– Vérace
Jun 6 at 12:53
add a comment |
As to your solution: smart and with a solid explanation. But what about these cases: ''
, NULL
, '"§$%'
, '-'
? No words. The count should be 0
- but your solution drops these rows altogether.
Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...
Regular expression based string processing
Similar to your solution, with some alternative suggestions:
SELECT id
, COALESCE(cardinality(arr), 0) AS word_count
, unique_word_count
, description
FROM (
SELECT *
, string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
FROM wordcount
) a
LEFT JOIN LATERAL (
SELECT count(DISTINCT elem) AS unique_word_count
FROM unnest(arr) elem
) b ON true;
db<>fiddle here (extended test case)
The core is regexp_replace(description, 'W+', ' ', 'g')
to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.
Followed up by cheap trim()
to remove leading/trailing spaces, and string_to_array()
to convert the prepared string to an array.
Get word_count
from the array directly. Again: cheap.
The unique_word_count
from a LATERAL
subquery with count(DISTINCT ...)
. That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.
The COALESCE
in the outer SELECT
takes care of NULL
input (the original question did not mention a NOT NULL
constraint). Optional, in case you need 0
instead of NULL
.
Or (faster in a quick test with short strings):
SELECT id
, count(*) AS word_count
, count(DISTINCT elem) AS unique_word_count
, description
FROM (
SELECT id, description
, unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
FROM wordcount
) sub
GROUP BY id, description;
This drops rows with 0 words like your answer does.
(Ab-)using text search parser
Using the text search function ts_parse()
is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":
SELECT * FROM ts_token_type('default')
For only "ASCII Words":
(Unlike above, the underscore (_
) is not treated as word character here):
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
To keep _
from separating words, use simple replace()
first:
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
Again, to keep all rows:
SELECT w.id
, count(token) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w
LEFT JOIN LATERAL (
SELECT t.token
FROM ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
) t ON true
GROUP BY w.id;
db<>fiddle here
Related:
- Efficient merging (removing duplicates) of arrays
- How do I select arrays that are not empty?
- What is the difference between LATERAL and a subquery in PostgreSQL?
Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.
– Vérace
Jun 7 at 15:56
For the regular expression, an underscore (_
) is a word character, but the dash (-
) is not. So you getunder_score
,under
andscore
- 3 "words". I also clarifiedts_parse()
which knows various kinds of "words". I added some more above to clarify.
– Erwin Brandstetter
Jun 7 at 16:19
add a comment |
As to your solution: smart and with a solid explanation. But what about these cases: ''
, NULL
, '"§$%'
, '-'
? No words. The count should be 0
- but your solution drops these rows altogether.
Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...
Regular expression based string processing
Similar to your solution, with some alternative suggestions:
SELECT id
, COALESCE(cardinality(arr), 0) AS word_count
, unique_word_count
, description
FROM (
SELECT *
, string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
FROM wordcount
) a
LEFT JOIN LATERAL (
SELECT count(DISTINCT elem) AS unique_word_count
FROM unnest(arr) elem
) b ON true;
db<>fiddle here (extended test case)
The core is regexp_replace(description, 'W+', ' ', 'g')
to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.
Followed up by cheap trim()
to remove leading/trailing spaces, and string_to_array()
to convert the prepared string to an array.
Get word_count
from the array directly. Again: cheap.
The unique_word_count
from a LATERAL
subquery with count(DISTINCT ...)
. That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.
The COALESCE
in the outer SELECT
takes care of NULL
input (the original question did not mention a NOT NULL
constraint). Optional, in case you need 0
instead of NULL
.
Or (faster in a quick test with short strings):
SELECT id
, count(*) AS word_count
, count(DISTINCT elem) AS unique_word_count
, description
FROM (
SELECT id, description
, unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
FROM wordcount
) sub
GROUP BY id, description;
This drops rows with 0 words like your answer does.
(Ab-)using text search parser
Using the text search function ts_parse()
is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":
SELECT * FROM ts_token_type('default')
For only "ASCII Words":
(Unlike above, the underscore (_
) is not treated as word character here):
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
To keep _
from separating words, use simple replace()
first:
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
Again, to keep all rows:
SELECT w.id
, count(token) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w
LEFT JOIN LATERAL (
SELECT t.token
FROM ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
) t ON true
GROUP BY w.id;
db<>fiddle here
Related:
- Efficient merging (removing duplicates) of arrays
- How do I select arrays that are not empty?
- What is the difference between LATERAL and a subquery in PostgreSQL?
Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.
– Vérace
Jun 7 at 15:56
For the regular expression, an underscore (_
) is a word character, but the dash (-
) is not. So you getunder_score
,under
andscore
- 3 "words". I also clarifiedts_parse()
which knows various kinds of "words". I added some more above to clarify.
– Erwin Brandstetter
Jun 7 at 16:19
add a comment |
As to your solution: smart and with a solid explanation. But what about these cases: ''
, NULL
, '"§$%'
, '-'
? No words. The count should be 0
- but your solution drops these rows altogether.
Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...
Regular expression based string processing
Similar to your solution, with some alternative suggestions:
SELECT id
, COALESCE(cardinality(arr), 0) AS word_count
, unique_word_count
, description
FROM (
SELECT *
, string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
FROM wordcount
) a
LEFT JOIN LATERAL (
SELECT count(DISTINCT elem) AS unique_word_count
FROM unnest(arr) elem
) b ON true;
db<>fiddle here (extended test case)
The core is regexp_replace(description, 'W+', ' ', 'g')
to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.
Followed up by cheap trim()
to remove leading/trailing spaces, and string_to_array()
to convert the prepared string to an array.
Get word_count
from the array directly. Again: cheap.
The unique_word_count
from a LATERAL
subquery with count(DISTINCT ...)
. That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.
The COALESCE
in the outer SELECT
takes care of NULL
input (the original question did not mention a NOT NULL
constraint). Optional, in case you need 0
instead of NULL
.
Or (faster in a quick test with short strings):
SELECT id
, count(*) AS word_count
, count(DISTINCT elem) AS unique_word_count
, description
FROM (
SELECT id, description
, unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
FROM wordcount
) sub
GROUP BY id, description;
This drops rows with 0 words like your answer does.
(Ab-)using text search parser
Using the text search function ts_parse()
is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":
SELECT * FROM ts_token_type('default')
For only "ASCII Words":
(Unlike above, the underscore (_
) is not treated as word character here):
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
To keep _
from separating words, use simple replace()
first:
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
Again, to keep all rows:
SELECT w.id
, count(token) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w
LEFT JOIN LATERAL (
SELECT t.token
FROM ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
) t ON true
GROUP BY w.id;
db<>fiddle here
Related:
- Efficient merging (removing duplicates) of arrays
- How do I select arrays that are not empty?
- What is the difference between LATERAL and a subquery in PostgreSQL?
As to your solution: smart and with a solid explanation. But what about these cases: ''
, NULL
, '"§$%'
, '-'
? No words. The count should be 0
- but your solution drops these rows altogether.
Also, any solution first and foremost depends on the exact definition of "word", which can vary wildly ...
Regular expression based string processing
Similar to your solution, with some alternative suggestions:
SELECT id
, COALESCE(cardinality(arr), 0) AS word_count
, unique_word_count
, description
FROM (
SELECT *
, string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ') AS arr
FROM wordcount
) a
LEFT JOIN LATERAL (
SELECT count(DISTINCT elem) AS unique_word_count
FROM unnest(arr) elem
) b ON true;
db<>fiddle here (extended test case)
The core is regexp_replace(description, 'W+', ' ', 'g')
to replace all substrings of non-word characters with a single space. See Regular Expression Class-shorthand Escapes. This removes all noise early in the game.
Followed up by cheap trim()
to remove leading/trailing spaces, and string_to_array()
to convert the prepared string to an array.
Get word_count
from the array directly. Again: cheap.
The unique_word_count
from a LATERAL
subquery with count(DISTINCT ...)
. That part may or may not be slower than a total unnest / aggregate. It's a bit simpler.
The COALESCE
in the outer SELECT
takes care of NULL
input (the original question did not mention a NOT NULL
constraint). Optional, in case you need 0
instead of NULL
.
Or (faster in a quick test with short strings):
SELECT id
, count(*) AS word_count
, count(DISTINCT elem) AS unique_word_count
, description
FROM (
SELECT id, description
, unnest(string_to_array(trim(regexp_replace(description, 'W+', ' ', 'g')), ' ')) AS elem
FROM wordcount
) sub
GROUP BY id, description;
This drops rows with 0 words like your answer does.
(Ab-)using text search parser
Using the text search function ts_parse()
is simpler. May or may not be faster. But first study the various tokens identified by the the text search parser and see what matches your definition of "word":
SELECT * FROM ts_token_type('default')
For only "ASCII Words":
(Unlike above, the underscore (_
) is not treated as word character here):
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
To keep _
from separating words, use simple replace()
first:
SELECT w.id
, count(*) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w, ts_parse('default', replace(w.description, '_', 'x')) t
WHERE t.tokid = 1 -- 'asciiword'
GROUP BY w.id;
Again, to keep all rows:
SELECT w.id
, count(token) AS word_count
, count(DISTINCT token) AS unique_word_count
, w.description
FROM wordcount w
LEFT JOIN LATERAL (
SELECT t.token
FROM ts_parse('default', w.description) t
WHERE t.tokid = 1 -- 'asciiword'
) t ON true
GROUP BY w.id;
db<>fiddle here
Related:
- Efficient merging (removing duplicates) of arrays
- How do I select arrays that are not empty?
- What is the difference between LATERAL and a subquery in PostgreSQL?
edited Jun 7 at 16:14
answered Jun 7 at 3:55
Erwin BrandstetterErwin Brandstetter
98.5k9 gold badges203 silver badges325 bronze badges
98.5k9 gold badges203 silver badges325 bronze badges
Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.
– Vérace
Jun 7 at 15:56
For the regular expression, an underscore (_
) is a word character, but the dash (-
) is not. So you getunder_score
,under
andscore
- 3 "words". I also clarifiedts_parse()
which knows various kinds of "words". I added some more above to clarify.
– Erwin Brandstetter
Jun 7 at 16:19
add a comment |
Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.
– Vérace
Jun 7 at 15:56
For the regular expression, an underscore (_
) is a word character, but the dash (-
) is not. So you getunder_score
,under
andscore
- 3 "words". I also clarifiedts_parse()
which knows various kinds of "words". I added some more above to clarify.
– Erwin Brandstetter
Jun 7 at 16:19
Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.
– Vérace
Jun 7 at 15:56
Hi Erwin, and thanks for your usual high standard of reply! However, I played around with your solution and am puzzled by one thing - if I use the string 'under_score under-score', I get a word count of 3 and a unique word count of 3 - 2 and 2 I could understand. Fiddle here.
– Vérace
Jun 7 at 15:56
For the regular expression, an underscore (
_
) is a word character, but the dash (-
) is not. So you get under_score
,under
and score
- 3 "words". I also clarified ts_parse()
which knows various kinds of "words". I added some more above to clarify.– Erwin Brandstetter
Jun 7 at 16:19
For the regular expression, an underscore (
_
) is a word character, but the dash (-
) is not. So you get under_score
,under
and score
- 3 "words". I also clarified ts_parse()
which knows various kinds of "words". I added some more above to clarify.– Erwin Brandstetter
Jun 7 at 16:19
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f239958%2fwordcount-in-a-field-all-and-unique-is-there-a-more-elegant-optimal-way%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown