Is numpy.corrcoef() enough to find correlation?How to get correlation between two categorical variable and a categorical variable and continuous variable?Find correlation in observed dataUse of Correlation scoreCorrelation and feature selectionExporting Correlation Matrix (from function)How decision trees work in PythonHow to interpret partial dependence interaction plot for binary classification?How to select variables based on the mean correlation in a correlation matrix?How to approach a machine learning problem?Why is my Seaborn distplot creating bouncing lines instead of smooth lines?What measures can I use to find correlation between categorical features and binary label?

What is the object moving across the ceiling in this stock footage?

Why are these traces shaped in such way?

Why do Russians call their women expensive ("дорогая")?

How did early x86 BIOS programmers manage to program full blown TUIs given very few bytes of ROM/EPROM?

Python program to convert a 24 hour format to 12 hour format

Can't remember the name of this game

Why colon to denote that a value belongs to a type?

What is the 中 in ダウンロード中？

How to prevent bad sectors?

Would jet fuel for an F-16 or F-35 be producible during WW2?

How were these pictures of spacecraft wind tunnel testing taken?

Why are C64 games inconsistent with which joystick port they use?

Is there a general effective method to solve Smullyan style Knights and Knaves problems? Is the truth table method the most appropriate one?

ESTA/WVP - leaving US within 90 days, then staying in DR

Is the first derivative operation on a signal a causal system?

Can you heal a summoned creature?

How do I subvert the tropes of a train heist?

Why is desire the root of suffering?

What does the view outside my ship traveling at light speed look like?

Where is the encrypted mask value?

Placing bypass capacitors after VCC reaches the IC

Command to Search for Filenames Exceeding 143 Characters?

Windows 10 Programs start without visual Interface

How can people dance around bonfires on Lag Lo'Omer - it's darchei emori?

Is numpy.corrcoef() enough to find correlation?

How to get correlation between two categorical variable and a categorical variable and continuous variable?Find correlation in observed dataUse of Correlation scoreCorrelation and feature selectionExporting Correlation Matrix (from function)How decision trees work in PythonHow to interpret partial dependence interaction plot for binary classification?How to select variables based on the mean correlation in a correlation matrix?How to approach a machine learning problem?Why is my Seaborn distplot creating bouncing lines instead of smooth lines?What measures can I use to find correlation between categorical features and binary label?

I am currently working through Kaggle's titanic competition and I'm trying to figure out the correlation between the Survived column and other columns. I am using numpy.corrcoef() to matrix the correlation between the columns and here is what I have:

The correlation between pClass & Survived is: [[ 1. -0.33848104]
 [-0.33848104 1. ]]

The correlation between Sex & Survived is: [[ 1. -0.54335138]
 [-0.54335138 1. ]]

The correlation between Age & Survived is:[[ 1. -0.07065723]
 [-0.07065723 1. ]]

The correlation between Fare & Survived is: [[1. 0.25730652]
 [0.25730652 1. ]]

The correlation between Parent-Children & Survived is: [[1. 0.08162941]
 [0.08162941 1. ]]

The correlation between Sibling-Spouse & Survived is: [[ 1. -0.0353225]
 [-0.0353225 1. ]]

The correlation between Embarked & Survived is: [[ 1. -0.16767531]
 [-0.16767531 1. ]]

There should be higher correlation between Survived and [pClass, sex, Sibling-Spouse] and yet the values are really low. I'm new to this so I understand that a simple method is not the best way to find correlations but at the moment, this doesn't add up.

This is my full code (without the printf() calls):

import pandas as pd
import numpy as np

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")

survived = train['Survived']
pClass = train['Pclass']
sex = train['Sex'].replace(['female', 'male'], [0, 1])
age = train['Age'].fillna(round(float(np.mean(train['Age'].dropna()))))
fare = train['Fare']
parch = train['Parch']
sibSp = train['SibSp']
embarked = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])

edited May 14 at 13:45

Juan Esteban de la Calle

1,363324

asked May 14 at 9:45

Atilla Adrianopolos

1134

$begingroup$
why do you think the values should be higher?
$endgroup$
– nairboon
May 14 at 9:56

$begingroup$
Because there is a strong correlation between sex, class and survival. Women and rich passengers were most likely to survive.
$endgroup$
– Atilla Adrianopolos
May 14 at 9:59

add a comment |

The correlation between pClass & Survived is: [[ 1. -0.33848104]
 [-0.33848104 1. ]]

The correlation between Sex & Survived is: [[ 1. -0.54335138]
 [-0.54335138 1. ]]

The correlation between Age & Survived is:[[ 1. -0.07065723]
 [-0.07065723 1. ]]

The correlation between Fare & Survived is: [[1. 0.25730652]
 [0.25730652 1. ]]

The correlation between Parent-Children & Survived is: [[1. 0.08162941]
 [0.08162941 1. ]]

The correlation between Sibling-Spouse & Survived is: [[ 1. -0.0353225]
 [-0.0353225 1. ]]

The correlation between Embarked & Survived is: [[ 1. -0.16767531]
 [-0.16767531 1. ]]

This is my full code (without the printf() calls):

import pandas as pd
import numpy as np

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")

survived = train['Survived']
pClass = train['Pclass']
sex = train['Sex'].replace(['female', 'male'], [0, 1])
age = train['Age'].fillna(round(float(np.mean(train['Age'].dropna()))))
fare = train['Fare']
parch = train['Parch']
sibSp = train['SibSp']
embarked = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])

edited May 14 at 13:45

Juan Esteban de la Calle

1,363324

asked May 14 at 9:45

Atilla Adrianopolos

1134

$begingroup$
why do you think the values should be higher?
$endgroup$
– nairboon
May 14 at 9:56

$begingroup$
Because there is a strong correlation between sex, class and survival. Women and rich passengers were most likely to survive.
$endgroup$
– Atilla Adrianopolos
May 14 at 9:59

add a comment |

The correlation between pClass & Survived is: [[ 1. -0.33848104]
 [-0.33848104 1. ]]

The correlation between Sex & Survived is: [[ 1. -0.54335138]
 [-0.54335138 1. ]]

The correlation between Age & Survived is:[[ 1. -0.07065723]
 [-0.07065723 1. ]]

The correlation between Fare & Survived is: [[1. 0.25730652]
 [0.25730652 1. ]]

The correlation between Parent-Children & Survived is: [[1. 0.08162941]
 [0.08162941 1. ]]

The correlation between Sibling-Spouse & Survived is: [[ 1. -0.0353225]
 [-0.0353225 1. ]]

The correlation between Embarked & Survived is: [[ 1. -0.16767531]
 [-0.16767531 1. ]]

This is my full code (without the printf() calls):

import pandas as pd
import numpy as np

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")

survived = train['Survived']
pClass = train['Pclass']
sex = train['Sex'].replace(['female', 'male'], [0, 1])
age = train['Age'].fillna(round(float(np.mean(train['Age'].dropna()))))
fare = train['Fare']
parch = train['Parch']
sibSp = train['SibSp']
embarked = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])

edited May 14 at 13:45

Juan Esteban de la Calle

1,363324

asked May 14 at 9:45

Atilla Adrianopolos

1134

The correlation between pClass & Survived is: [[ 1. -0.33848104]
 [-0.33848104 1. ]]

The correlation between Sex & Survived is: [[ 1. -0.54335138]
 [-0.54335138 1. ]]

The correlation between Age & Survived is:[[ 1. -0.07065723]
 [-0.07065723 1. ]]

The correlation between Fare & Survived is: [[1. 0.25730652]
 [0.25730652 1. ]]

The correlation between Parent-Children & Survived is: [[1. 0.08162941]
 [0.08162941 1. ]]

The correlation between Sibling-Spouse & Survived is: [[ 1. -0.0353225]
 [-0.0353225 1. ]]

The correlation between Embarked & Survived is: [[ 1. -0.16767531]
 [-0.16767531 1. ]]

This is my full code (without the printf() calls):

import pandas as pd
import numpy as np

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")

survived = train['Survived']
pClass = train['Pclass']
sex = train['Sex'].replace(['female', 'male'], [0, 1])
age = train['Age'].fillna(round(float(np.mean(train['Age'].dropna()))))
fare = train['Fare']
parch = train['Parch']
sibSp = train['SibSp']
embarked = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])

machine-learning python feature-selection numpy kaggle

edited May 14 at 13:45

Juan Esteban de la Calle

1,363324

asked May 14 at 9:45

Atilla Adrianopolos

1134

edited May 14 at 13:45

Juan Esteban de la Calle

1,363324

asked May 14 at 9:45

Atilla Adrianopolos

1134

edited May 14 at 13:45

Juan Esteban de la Calle

1,363324

edited May 14 at 13:45

Juan Esteban de la Calle

1,363324

edited May 14 at 13:45

Juan Esteban de la Calle

1,363324

asked May 14 at 9:45

Atilla Adrianopolos

1134

asked May 14 at 9:45

Atilla Adrianopolos

1134

asked May 14 at 9:45

Atilla Adrianopolos

1134

$begingroup$
why do you think the values should be higher?
$endgroup$
– nairboon
May 14 at 9:56

$begingroup$
Because there is a strong correlation between sex, class and survival. Women and rich passengers were most likely to survive.
$endgroup$
– Atilla Adrianopolos
May 14 at 9:59

add a comment |

$begingroup$
why do you think the values should be higher?
$endgroup$
– nairboon
May 14 at 9:56

$begingroup$
Because there is a strong correlation between sex, class and survival. Women and rich passengers were most likely to survive.
$endgroup$
– Atilla Adrianopolos
May 14 at 9:59

why do you think the values should be higher?

– nairboon
May 14 at 9:56

Because there is a strong correlation between sex, class and survival. Women and rich passengers were most likely to survive.

– Atilla Adrianopolos
May 14 at 9:59

add a comment |

2 Answers
2

active

oldest

votes

On a side note, I don't think correlation is the correct measure of relation for you to be using, since Survived is technically a binary categorical variable.

"Correlation" measures used should depend on the type of variables being investigated:

continuous variable v continuous variable: use "traditional" correlation - e.g. Spearman's rank correlation or Pearson's linear correlation.

continuous variable v categorical variable: use an ANOVA F-test / difference of means

categorical variable v categorical variable: use Chi-square / Cramer's V

answered May 14 at 11:07

bradS

783214

1

$begingroup$
Here is a closely related old post.
$endgroup$
– Esmailian
May 18 at 15:29

$begingroup$
@bradS When you say ANOVA F-test/difference of means, do you mean dividing ANOVA F-test by difference of means?
$endgroup$
– Atilla Adrianopolos
May 19 at 17:50

$begingroup$
@AtillaAdrianopolos, no I mean "/" as "or". Using item 3 above as an example, use Chi-square test of independence or Cramer's V.
$endgroup$
– bradS
May 20 at 8:09

add a comment |

You probably encoded Women as 0 and men as 1 that's why you get a negative correlation of -0.54, because Survived is 0 for No and 1 for Yes. Your calculation actually show what you've expected. The negative correlation is only about the direction depending on your encoding, the relationship between Women and Survived is 0.54.

Similarly pClass is correlated negatively with -0.33 because the highest class (1st class) is encoded as 1 and the lowest as 3, thus the direction is negative.

You could make the relations more intuitive if you make new columns for men and women where you put 0 and 1 depending on the sex, then the correlations will have the intuitive direction (sign). The same holds for pClass.

edited May 14 at 13:16

Stephen Rauch♦

1,51361330

answered May 14 at 10:10

nairboon

1132

$begingroup$
I've added my code.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:14

$begingroup$
What if I encode male/female with 3/4 instead? They're still binary values and just might solve the problem you're raisng.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:15

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f51935%2fis-numpy-corrcoef-enough-to-find-correlation%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

On a side note, I don't think correlation is the correct measure of relation for you to be using, since Survived is technically a binary categorical variable.

"Correlation" measures used should depend on the type of variables being investigated:

continuous variable v continuous variable: use "traditional" correlation - e.g. Spearman's rank correlation or Pearson's linear correlation.

continuous variable v categorical variable: use an ANOVA F-test / difference of means

categorical variable v categorical variable: use Chi-square / Cramer's V

answered May 14 at 11:07

bradS

783214

1

$begingroup$
Here is a closely related old post.
$endgroup$
– Esmailian
May 18 at 15:29

$begingroup$
@bradS When you say ANOVA F-test/difference of means, do you mean dividing ANOVA F-test by difference of means?
$endgroup$
– Atilla Adrianopolos
May 19 at 17:50

$begingroup$
@AtillaAdrianopolos, no I mean "/" as "or". Using item 3 above as an example, use Chi-square test of independence or Cramer's V.
$endgroup$
– bradS
May 20 at 8:09

add a comment |

On a side note, I don't think correlation is the correct measure of relation for you to be using, since Survived is technically a binary categorical variable.

"Correlation" measures used should depend on the type of variables being investigated:

continuous variable v continuous variable: use "traditional" correlation - e.g. Spearman's rank correlation or Pearson's linear correlation.

continuous variable v categorical variable: use an ANOVA F-test / difference of means

categorical variable v categorical variable: use Chi-square / Cramer's V

answered May 14 at 11:07

bradS

783214

1

$begingroup$
Here is a closely related old post.
$endgroup$
– Esmailian
May 18 at 15:29

$begingroup$
@bradS When you say ANOVA F-test/difference of means, do you mean dividing ANOVA F-test by difference of means?
$endgroup$
– Atilla Adrianopolos
May 19 at 17:50

$begingroup$
@AtillaAdrianopolos, no I mean "/" as "or". Using item 3 above as an example, use Chi-square test of independence or Cramer's V.
$endgroup$
– bradS
May 20 at 8:09

add a comment |

On a side note, I don't think correlation is the correct measure of relation for you to be using, since Survived is technically a binary categorical variable.

"Correlation" measures used should depend on the type of variables being investigated:

continuous variable v continuous variable: use "traditional" correlation - e.g. Spearman's rank correlation or Pearson's linear correlation.

continuous variable v categorical variable: use an ANOVA F-test / difference of means

categorical variable v categorical variable: use Chi-square / Cramer's V

answered May 14 at 11:07

bradS

783214

On a side note, I don't think correlation is the correct measure of relation for you to be using, since Survived is technically a binary categorical variable.

"Correlation" measures used should depend on the type of variables being investigated:

continuous variable v continuous variable: use "traditional" correlation - e.g. Spearman's rank correlation or Pearson's linear correlation.

continuous variable v categorical variable: use an ANOVA F-test / difference of means

categorical variable v categorical variable: use Chi-square / Cramer's V

answered May 14 at 11:07

bradS

783214

answered May 14 at 11:07

bradS

783214

answered May 14 at 11:07

bradS

783214

answered May 14 at 11:07

bradS

783214

1

$begingroup$
Here is a closely related old post.
$endgroup$
– Esmailian
May 18 at 15:29

$begingroup$
@bradS When you say ANOVA F-test/difference of means, do you mean dividing ANOVA F-test by difference of means?
$endgroup$
– Atilla Adrianopolos
May 19 at 17:50

$begingroup$
@AtillaAdrianopolos, no I mean "/" as "or". Using item 3 above as an example, use Chi-square test of independence or Cramer's V.
$endgroup$
– bradS
May 20 at 8:09

add a comment |

1

$begingroup$
Here is a closely related old post.
$endgroup$
– Esmailian
May 18 at 15:29

$begingroup$
@bradS When you say ANOVA F-test/difference of means, do you mean dividing ANOVA F-test by difference of means?
$endgroup$
– Atilla Adrianopolos
May 19 at 17:50

$begingroup$
@AtillaAdrianopolos, no I mean "/" as "or". Using item 3 above as an example, use Chi-square test of independence or Cramer's V.
$endgroup$
– bradS
May 20 at 8:09

Here is a closely related old post.

– Esmailian
May 18 at 15:29

@bradS When you say ANOVA F-test/difference of means, do you mean dividing ANOVA F-test by difference of means?

– Atilla Adrianopolos
May 19 at 17:50

@AtillaAdrianopolos, no I mean "/" as "or". Using item 3 above as an example, use Chi-square test of independence or Cramer's V.

– bradS
May 20 at 8:09

add a comment |

Similarly pClass is correlated negatively with -0.33 because the highest class (1st class) is encoded as 1 and the lowest as 3, thus the direction is negative.

edited May 14 at 13:16

Stephen Rauch♦

1,51361330

answered May 14 at 10:10

nairboon

1132

$begingroup$
I've added my code.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:14

$begingroup$
What if I encode male/female with 3/4 instead? They're still binary values and just might solve the problem you're raisng.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:15

add a comment |

Similarly pClass is correlated negatively with -0.33 because the highest class (1st class) is encoded as 1 and the lowest as 3, thus the direction is negative.

edited May 14 at 13:16

Stephen Rauch♦

1,51361330

answered May 14 at 10:10

nairboon

1132

$begingroup$
I've added my code.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:14

$begingroup$
What if I encode male/female with 3/4 instead? They're still binary values and just might solve the problem you're raisng.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:15

add a comment |

Similarly pClass is correlated negatively with -0.33 because the highest class (1st class) is encoded as 1 and the lowest as 3, thus the direction is negative.

edited May 14 at 13:16

Stephen Rauch♦

1,51361330

answered May 14 at 10:10

nairboon

1132

Similarly pClass is correlated negatively with -0.33 because the highest class (1st class) is encoded as 1 and the lowest as 3, thus the direction is negative.

edited May 14 at 13:16

Stephen Rauch♦

1,51361330

answered May 14 at 10:10

nairboon

1132

edited May 14 at 13:16

Stephen Rauch♦

1,51361330

edited May 14 at 13:16

Stephen Rauch♦

1,51361330

edited May 14 at 13:16

Stephen Rauch♦

1,51361330

answered May 14 at 10:10

nairboon

1132

answered May 14 at 10:10

nairboon

1132

answered May 14 at 10:10

nairboon

1132

$begingroup$
I've added my code.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:14

$begingroup$
What if I encode male/female with 3/4 instead? They're still binary values and just might solve the problem you're raisng.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:15

add a comment |

$begingroup$
I've added my code.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:14

$begingroup$
What if I encode male/female with 3/4 instead? They're still binary values and just might solve the problem you're raisng.
$endgroup$
– Atilla Adrianopolos
May 14 at 10:15

I've added my code.

– Atilla Adrianopolos
May 14 at 10:14

What if I encode male/female with 3/4 instead? They're still binary values and just might solve the problem you're raisng.

– Atilla Adrianopolos
May 14 at 10:15

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

qE xKBTwrG0DjUK SJNZYGx466FhINs4I b,s2kXV1e,r,6gpZ94fJUJgXuGCfPolcUN,HALDH8ld6Puf8RRJPgpyd4Le2ubZH8

搜尋此網誌

Otdfbt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O﻿ / ﻿43.24775, -8.60070

2 Answers
2

2 Answers
2

2 Answers
2

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070