What must someone know in statistics and machine learning? [closed] Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)The Two Cultures: statistics vs. machine learning?How to understand the drawbacks of K-meansWhat data and statistics skills are currently in high demand and where are they in high demand?What theories should every statistician know?Is it important for statisticians to learn machine learning?Opportunities in machine learning and computational intelligenceWhat is a data scientist?What are the classical notations in statistics, linear algebra and machine learning? And what are the connections between these notations?Subjects in machine learningDoing correct statistics in a working environment?In general, is doing inference more difficult than making prediction?likelihood of machine learning model with continuous data must be zero?
What is the longest distance a player character can jump in one leap?
Where are Serre’s lectures at Collège de France to be found?
Compare a given version number in the form major.minor.build.patch and see if one is less than the other
What is homebrew?
What does "lightly crushed" mean for cardamon pods?
Circuit to "zoom in" on mV fluctuations of a DC signal?
Around usage results
How to compare two different files line by line in unix?
Did MS DOS itself ever use blinking text?
What would be the ideal power source for a cybernetic eye?
Denied boarding although I have proper visa and documentation. To whom should I make a complaint?
Why are the trig functions versine, haversine, exsecant, etc, rarely used in modern mathematics?
What are the out-of-universe reasons for the references to Toby Maguire-era Spider-Man in ITSV
How to show element name in portuguese using elements package?
Fantasy story; one type of magic grows in power with use, but the more powerful they are, they more they are drawn to travel to their source
Closed form of recurrent arithmetic series summation
When the Haste spell ends on a creature, do attackers have advantage against that creature?
Can a party unilaterally change candidates in preparation for a General election?
How to find all the available tools in mac terminal?
How do I make this wiring inside cabinet safer? (Pic)
If a VARCHAR(MAX) column is included in an index, is the entire value always stored in the index page(s)?
What does the "x" in "x86" represent?
What does this Jacques Hadamard quote mean?
8 Prisoners wearing hats
What must someone know in statistics and machine learning? [closed]
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)The Two Cultures: statistics vs. machine learning?How to understand the drawbacks of K-meansWhat data and statistics skills are currently in high demand and where are they in high demand?What theories should every statistician know?Is it important for statisticians to learn machine learning?Opportunities in machine learning and computational intelligenceWhat is a data scientist?What are the classical notations in statistics, linear algebra and machine learning? And what are the connections between these notations?Subjects in machine learningDoing correct statistics in a working environment?In general, is doing inference more difficult than making prediction?likelihood of machine learning model with continuous data must be zero?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.
So my question is: What do you consider a statician/ ML professional must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?
The answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career.
self-study careers
$endgroup$
closed as primarily opinion-based by Martijn Weterings, amoeba, mdewey, mkt, whuber♦ Apr 12 at 14:26
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. If this question can be reworded to fit the rules in the help center, please edit the question.
locked by whuber♦ Apr 12 at 14:24
This post has been locked while disputes about its content are being resolved. For more info visit meta.
Read more about locked posts here.
comments disabled on deleted / locked posts / reviews |
show 14 more comments
$begingroup$
There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.
So my question is: What do you consider a statician/ ML professional must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?
The answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career.
self-study careers
$endgroup$
closed as primarily opinion-based by Martijn Weterings, amoeba, mdewey, mkt, whuber♦ Apr 12 at 14:26
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. If this question can be reworded to fit the rules in the help center, please edit the question.
locked by whuber♦ Apr 12 at 14:24
This post has been locked while disputes about its content are being resolved. For more info visit meta.
Read more about locked posts here.
7
$begingroup$
I am voting to reopen this question and convert it to a wiki.
$endgroup$
– Ferdi
Apr 11 at 7:43
3
$begingroup$
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
$endgroup$
– mdewey
Apr 11 at 8:42
3
$begingroup$
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
$endgroup$
– Martijn Weterings
Apr 11 at 13:22
3
$begingroup$
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
$endgroup$
– Sycorax
Apr 11 at 13:36
4
$begingroup$
Note that the proper status of this thread is being discussed on Cross Validated Meta: This opinion-based question must be closed. A strict reading of the standards for SE threads would unambiguously lead to this Q being closed. However, in the past we have sometimes made exceptions for threads that seemed to have a lot of value that would otherwise be lost (& made those threads CW). It remains to be seen whether this thread should be judged to fall in that 'gray area' category or not.
$endgroup$
– gung♦
Apr 11 at 13:58
comments disabled on deleted / locked posts / reviews |
show 14 more comments
$begingroup$
There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.
So my question is: What do you consider a statician/ ML professional must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?
The answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career.
self-study careers
$endgroup$
There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.
So my question is: What do you consider a statician/ ML professional must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?
The answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career.
self-study careers
self-study careers
edited Apr 12 at 14:24
community wiki
13 revs, 4 users 75%
igoR87
closed as primarily opinion-based by Martijn Weterings, amoeba, mdewey, mkt, whuber♦ Apr 12 at 14:26
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. If this question can be reworded to fit the rules in the help center, please edit the question.
locked by whuber♦ Apr 12 at 14:24
This post has been locked while disputes about its content are being resolved. For more info visit meta.
Read more about locked posts here.
closed as primarily opinion-based by Martijn Weterings, amoeba, mdewey, mkt, whuber♦ Apr 12 at 14:26
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. If this question can be reworded to fit the rules in the help center, please edit the question.
locked by whuber♦ Apr 12 at 14:24
This post has been locked while disputes about its content are being resolved. For more info visit meta.
Read more about locked posts here.
7
$begingroup$
I am voting to reopen this question and convert it to a wiki.
$endgroup$
– Ferdi
Apr 11 at 7:43
3
$begingroup$
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
$endgroup$
– mdewey
Apr 11 at 8:42
3
$begingroup$
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
$endgroup$
– Martijn Weterings
Apr 11 at 13:22
3
$begingroup$
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
$endgroup$
– Sycorax
Apr 11 at 13:36
4
$begingroup$
Note that the proper status of this thread is being discussed on Cross Validated Meta: This opinion-based question must be closed. A strict reading of the standards for SE threads would unambiguously lead to this Q being closed. However, in the past we have sometimes made exceptions for threads that seemed to have a lot of value that would otherwise be lost (& made those threads CW). It remains to be seen whether this thread should be judged to fall in that 'gray area' category or not.
$endgroup$
– gung♦
Apr 11 at 13:58
comments disabled on deleted / locked posts / reviews |
show 14 more comments
7
$begingroup$
I am voting to reopen this question and convert it to a wiki.
$endgroup$
– Ferdi
Apr 11 at 7:43
3
$begingroup$
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
$endgroup$
– mdewey
Apr 11 at 8:42
3
$begingroup$
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
$endgroup$
– Martijn Weterings
Apr 11 at 13:22
3
$begingroup$
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
$endgroup$
– Sycorax
Apr 11 at 13:36
4
$begingroup$
Note that the proper status of this thread is being discussed on Cross Validated Meta: This opinion-based question must be closed. A strict reading of the standards for SE threads would unambiguously lead to this Q being closed. However, in the past we have sometimes made exceptions for threads that seemed to have a lot of value that would otherwise be lost (& made those threads CW). It remains to be seen whether this thread should be judged to fall in that 'gray area' category or not.
$endgroup$
– gung♦
Apr 11 at 13:58
7
7
$begingroup$
I am voting to reopen this question and convert it to a wiki.
$endgroup$
– Ferdi
Apr 11 at 7:43
$begingroup$
I am voting to reopen this question and convert it to a wiki.
$endgroup$
– Ferdi
Apr 11 at 7:43
3
3
$begingroup$
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
$endgroup$
– mdewey
Apr 11 at 8:42
$begingroup$
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
$endgroup$
– mdewey
Apr 11 at 8:42
3
3
$begingroup$
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
$endgroup$
– Martijn Weterings
Apr 11 at 13:22
$begingroup$
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
$endgroup$
– Martijn Weterings
Apr 11 at 13:22
3
3
$begingroup$
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
$endgroup$
– Sycorax
Apr 11 at 13:36
$begingroup$
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
$endgroup$
– Sycorax
Apr 11 at 13:36
4
4
$begingroup$
Note that the proper status of this thread is being discussed on Cross Validated Meta: This opinion-based question must be closed. A strict reading of the standards for SE threads would unambiguously lead to this Q being closed. However, in the past we have sometimes made exceptions for threads that seemed to have a lot of value that would otherwise be lost (& made those threads CW). It remains to be seen whether this thread should be judged to fall in that 'gray area' category or not.
$endgroup$
– gung♦
Apr 11 at 13:58
$begingroup$
Note that the proper status of this thread is being discussed on Cross Validated Meta: This opinion-based question must be closed. A strict reading of the standards for SE threads would unambiguously lead to this Q being closed. However, in the past we have sometimes made exceptions for threads that seemed to have a lot of value that would otherwise be lost (& made those threads CW). It remains to be seen whether this thread should be judged to fall in that 'gray area' category or not.
$endgroup$
– gung♦
Apr 11 at 13:58
comments disabled on deleted / locked posts / reviews |
show 14 more comments
4 Answers
4
active
oldest
votes
$begingroup$
The two worlds that you describe aren't really two different kinds of statistician, but rather:
- "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.
- statistics proper, as understood by mathematicians, statisticians, data scientists, etc.
The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.
The problem is that without a fairly in-depth understanding, they:
- are very likely to misuse statistics
- can't stray from the garden path
Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.
The following poem comes to mind:
A little learning is a dangerous thing;
Drink deep, or taste not the Pierian spring:
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.
- Alexander Pope, A Little Learning
I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."
What do you consider as must to know in statistics and machine learning?
The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.
This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.
There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!
At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.
But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.
If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.
Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.
At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.
What tests/ methods would you put in your toolbox?
All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:
- Ridge, Lasso, and ElasticNet Regression
- Local Regression (LOESS)
- Kernel Density Estimates
- PCA
- Factor Analysis
- K-means
- GMM (and other mixture models)
- Decision Trees, Random Forest, and XGBoost
- Time Series Analysis: ARIMA, possible exponential smoothing
- SVM (Support Vector Machines)
- Hidden Markov Models
- GAM (General Additive Models)
- Bayes Networks and Structual Equation Modeling
- Robust Regression
- Imputation
- Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.
Bayesian Inference with MCMC a la Stan- Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)
- Extreme value theory
- Vapnik–Chervonenkis theory
Causality- Pairwise/Perference modling e.g. Bradley-Terry
IRT (item response theory, used for surveys and tests)- Martingales
- Copulas
This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.
$endgroup$
2
$begingroup$
Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
$endgroup$
– Frank Harrell
Apr 11 at 12:58
1
$begingroup$
This post is a nice example of what a machine learning pracitioner/data scientist thinks statistics is, and if one knows all the above well one can do some serious modelling. However, it is very far from what a statistician would say a statistician should know, or what statistics comprises (as in what is published in top journals).
$endgroup$
– Forgottenscience
Apr 12 at 8:38
$begingroup$
@Forgottenscience yes, that's a fair assessment. The original question was asking about "statistics and machine learning" for "statistician/ ML professionals", after all. What are some of the top-of-mind subjects for research statisticians in 2019?
$endgroup$
– olooney
Apr 12 at 17:00
add a comment |
$begingroup$
Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.
Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.
Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.
$endgroup$
1
$begingroup$
It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
$endgroup$
– LSC
Apr 11 at 12:44
$begingroup$
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
$endgroup$
– Skander H.
Apr 11 at 14:19
2
$begingroup$
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
$endgroup$
– LSC
Apr 11 at 14:44
add a comment |
$begingroup$
What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).
What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously
Say that statistics is hard
Admit that they have little training or expertise in it and
Do it on their own anyway.
No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.
What I need to know is
When I am out of my depth. No one knows all this stuff, certainly I don't.
A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.
Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).
How to ask questions. A good data analyst asks a lot of questions.
Enough matrix algebra and calculus to at least read articles. But that's not all that much.
Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).
$endgroup$
add a comment |
$begingroup$
For a person doing work in statistics or doing work associated with statistics there is not really much clear must-know knowledge.
Obviously, people should be able to do simple and ordinary things, e.g. simple arithmetic.
But beyond that, statistics and machine learning is enormously broad and multidisciplinary. You might have a person doing only work writing SQL and managing databases or a person collecting data for the state, e.g. stuff like eurostat (statistics is etymologically derived from 'state'). Should those people know the Kolmogorov axiom's or should they know all types of Pearson distributions?
It is a bit similar like asking what tools a construction worker must be able to work with. An electrician is not like a carpenter, and a plumber is not like a plasterer. There is very little that they all must know and it will only be a fraction of their abilities.
$endgroup$
5
$begingroup$
@igoR87 I have correctly answered your question. This answer is "what I consider a person must know". If you believe that this is a useless answer, then it is at least useful in showing how incomplete and useless the question is.
$endgroup$
– Martijn Weterings
Apr 12 at 8:46
3
$begingroup$
@igoR87 this is my serious answer. I did not post it at first because I find it not very useful. So, yes, you are right that my motivation to post this answer is different. But it is what would be my answer. The answer is serious, but posting the answer is not.
$endgroup$
– Martijn Weterings
Apr 12 at 8:57
2
$begingroup$
It is because of these open possibilities to view your question that I find the question too vague. I can not improve your question because I can not know what you wanted it to look like more specifically. If this answer is not what you are looking for then this helps you to understand what is wrong with your question, and you can improve to make it more like your intentions (I do not know your intentions, I only know the question is vague).
$endgroup$
– Martijn Weterings
Apr 12 at 8:58
1
$begingroup$
Posting this answer is how I try to work to a better question. Like Peter Flom says, statisticians need to be good in asking and enquiring. Understand the problem/question the client has, not just start editing it yourselve based on (possibly false) assumptions.
$endgroup$
– Martijn Weterings
Apr 12 at 9:02
1
$begingroup$
@igoR87, while my intention to answer may be weird, this is my true answer and this does convey "what I consider must know knowledge" (I believe there is none besides very simple basics like arithmetics). Besides that, indicating how a question is wrongly phrased or can be interpreted is not an uncommon answering style on stackexchange.
$endgroup$
– Martijn Weterings
Apr 12 at 9:04
|
show 5 more comments
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The two worlds that you describe aren't really two different kinds of statistician, but rather:
- "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.
- statistics proper, as understood by mathematicians, statisticians, data scientists, etc.
The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.
The problem is that without a fairly in-depth understanding, they:
- are very likely to misuse statistics
- can't stray from the garden path
Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.
The following poem comes to mind:
A little learning is a dangerous thing;
Drink deep, or taste not the Pierian spring:
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.
- Alexander Pope, A Little Learning
I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."
What do you consider as must to know in statistics and machine learning?
The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.
This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.
There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!
At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.
But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.
If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.
Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.
At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.
What tests/ methods would you put in your toolbox?
All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:
- Ridge, Lasso, and ElasticNet Regression
- Local Regression (LOESS)
- Kernel Density Estimates
- PCA
- Factor Analysis
- K-means
- GMM (and other mixture models)
- Decision Trees, Random Forest, and XGBoost
- Time Series Analysis: ARIMA, possible exponential smoothing
- SVM (Support Vector Machines)
- Hidden Markov Models
- GAM (General Additive Models)
- Bayes Networks and Structual Equation Modeling
- Robust Regression
- Imputation
- Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.
Bayesian Inference with MCMC a la Stan- Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)
- Extreme value theory
- Vapnik–Chervonenkis theory
Causality- Pairwise/Perference modling e.g. Bradley-Terry
IRT (item response theory, used for surveys and tests)- Martingales
- Copulas
This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.
$endgroup$
2
$begingroup$
Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
$endgroup$
– Frank Harrell
Apr 11 at 12:58
1
$begingroup$
This post is a nice example of what a machine learning pracitioner/data scientist thinks statistics is, and if one knows all the above well one can do some serious modelling. However, it is very far from what a statistician would say a statistician should know, or what statistics comprises (as in what is published in top journals).
$endgroup$
– Forgottenscience
Apr 12 at 8:38
$begingroup$
@Forgottenscience yes, that's a fair assessment. The original question was asking about "statistics and machine learning" for "statistician/ ML professionals", after all. What are some of the top-of-mind subjects for research statisticians in 2019?
$endgroup$
– olooney
Apr 12 at 17:00
add a comment |
$begingroup$
The two worlds that you describe aren't really two different kinds of statistician, but rather:
- "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.
- statistics proper, as understood by mathematicians, statisticians, data scientists, etc.
The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.
The problem is that without a fairly in-depth understanding, they:
- are very likely to misuse statistics
- can't stray from the garden path
Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.
The following poem comes to mind:
A little learning is a dangerous thing;
Drink deep, or taste not the Pierian spring:
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.
- Alexander Pope, A Little Learning
I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."
What do you consider as must to know in statistics and machine learning?
The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.
This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.
There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!
At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.
But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.
If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.
Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.
At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.
What tests/ methods would you put in your toolbox?
All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:
- Ridge, Lasso, and ElasticNet Regression
- Local Regression (LOESS)
- Kernel Density Estimates
- PCA
- Factor Analysis
- K-means
- GMM (and other mixture models)
- Decision Trees, Random Forest, and XGBoost
- Time Series Analysis: ARIMA, possible exponential smoothing
- SVM (Support Vector Machines)
- Hidden Markov Models
- GAM (General Additive Models)
- Bayes Networks and Structual Equation Modeling
- Robust Regression
- Imputation
- Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.
Bayesian Inference with MCMC a la Stan- Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)
- Extreme value theory
- Vapnik–Chervonenkis theory
Causality- Pairwise/Perference modling e.g. Bradley-Terry
IRT (item response theory, used for surveys and tests)- Martingales
- Copulas
This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.
$endgroup$
2
$begingroup$
Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
$endgroup$
– Frank Harrell
Apr 11 at 12:58
1
$begingroup$
This post is a nice example of what a machine learning pracitioner/data scientist thinks statistics is, and if one knows all the above well one can do some serious modelling. However, it is very far from what a statistician would say a statistician should know, or what statistics comprises (as in what is published in top journals).
$endgroup$
– Forgottenscience
Apr 12 at 8:38
$begingroup$
@Forgottenscience yes, that's a fair assessment. The original question was asking about "statistics and machine learning" for "statistician/ ML professionals", after all. What are some of the top-of-mind subjects for research statisticians in 2019?
$endgroup$
– olooney
Apr 12 at 17:00
add a comment |
$begingroup$
The two worlds that you describe aren't really two different kinds of statistician, but rather:
- "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.
- statistics proper, as understood by mathematicians, statisticians, data scientists, etc.
The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.
The problem is that without a fairly in-depth understanding, they:
- are very likely to misuse statistics
- can't stray from the garden path
Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.
The following poem comes to mind:
A little learning is a dangerous thing;
Drink deep, or taste not the Pierian spring:
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.
- Alexander Pope, A Little Learning
I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."
What do you consider as must to know in statistics and machine learning?
The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.
This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.
There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!
At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.
But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.
If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.
Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.
At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.
What tests/ methods would you put in your toolbox?
All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:
- Ridge, Lasso, and ElasticNet Regression
- Local Regression (LOESS)
- Kernel Density Estimates
- PCA
- Factor Analysis
- K-means
- GMM (and other mixture models)
- Decision Trees, Random Forest, and XGBoost
- Time Series Analysis: ARIMA, possible exponential smoothing
- SVM (Support Vector Machines)
- Hidden Markov Models
- GAM (General Additive Models)
- Bayes Networks and Structual Equation Modeling
- Robust Regression
- Imputation
- Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.
Bayesian Inference with MCMC a la Stan- Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)
- Extreme value theory
- Vapnik–Chervonenkis theory
Causality- Pairwise/Perference modling e.g. Bradley-Terry
IRT (item response theory, used for surveys and tests)- Martingales
- Copulas
This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.
$endgroup$
The two worlds that you describe aren't really two different kinds of statistician, but rather:
- "statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.
- statistics proper, as understood by mathematicians, statisticians, data scientists, etc.
The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.
The problem is that without a fairly in-depth understanding, they:
- are very likely to misuse statistics
- can't stray from the garden path
Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.
The following poem comes to mind:
A little learning is a dangerous thing;
Drink deep, or taste not the Pierian spring:
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.
- Alexander Pope, A Little Learning
I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."
What do you consider as must to know in statistics and machine learning?
The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.
This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.
There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!
At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.
But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.
If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.
Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.
At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.
What tests/ methods would you put in your toolbox?
All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:
- Ridge, Lasso, and ElasticNet Regression
- Local Regression (LOESS)
- Kernel Density Estimates
- PCA
- Factor Analysis
- K-means
- GMM (and other mixture models)
- Decision Trees, Random Forest, and XGBoost
- Time Series Analysis: ARIMA, possible exponential smoothing
- SVM (Support Vector Machines)
- Hidden Markov Models
- GAM (General Additive Models)
- Bayes Networks and Structual Equation Modeling
- Robust Regression
- Imputation
- Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.
Bayesian Inference with MCMC a la Stan- Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)
- Extreme value theory
- Vapnik–Chervonenkis theory
Causality- Pairwise/Perference modling e.g. Bradley-Terry
IRT (item response theory, used for surveys and tests)- Martingales
- Copulas
This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.
edited Apr 12 at 17:16
community wiki
2 revs
olooney
2
$begingroup$
Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
$endgroup$
– Frank Harrell
Apr 11 at 12:58
1
$begingroup$
This post is a nice example of what a machine learning pracitioner/data scientist thinks statistics is, and if one knows all the above well one can do some serious modelling. However, it is very far from what a statistician would say a statistician should know, or what statistics comprises (as in what is published in top journals).
$endgroup$
– Forgottenscience
Apr 12 at 8:38
$begingroup$
@Forgottenscience yes, that's a fair assessment. The original question was asking about "statistics and machine learning" for "statistician/ ML professionals", after all. What are some of the top-of-mind subjects for research statisticians in 2019?
$endgroup$
– olooney
Apr 12 at 17:00
add a comment |
2
$begingroup$
Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
$endgroup$
– Frank Harrell
Apr 11 at 12:58
1
$begingroup$
This post is a nice example of what a machine learning pracitioner/data scientist thinks statistics is, and if one knows all the above well one can do some serious modelling. However, it is very far from what a statistician would say a statistician should know, or what statistics comprises (as in what is published in top journals).
$endgroup$
– Forgottenscience
Apr 12 at 8:38
$begingroup$
@Forgottenscience yes, that's a fair assessment. The original question was asking about "statistics and machine learning" for "statistician/ ML professionals", after all. What are some of the top-of-mind subjects for research statisticians in 2019?
$endgroup$
– olooney
Apr 12 at 17:00
2
2
$begingroup$
Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
$endgroup$
– Frank Harrell
Apr 11 at 12:58
$begingroup$
Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap.
$endgroup$
– Frank Harrell
Apr 11 at 12:58
1
1
$begingroup$
This post is a nice example of what a machine learning pracitioner/data scientist thinks statistics is, and if one knows all the above well one can do some serious modelling. However, it is very far from what a statistician would say a statistician should know, or what statistics comprises (as in what is published in top journals).
$endgroup$
– Forgottenscience
Apr 12 at 8:38
$begingroup$
This post is a nice example of what a machine learning pracitioner/data scientist thinks statistics is, and if one knows all the above well one can do some serious modelling. However, it is very far from what a statistician would say a statistician should know, or what statistics comprises (as in what is published in top journals).
$endgroup$
– Forgottenscience
Apr 12 at 8:38
$begingroup$
@Forgottenscience yes, that's a fair assessment. The original question was asking about "statistics and machine learning" for "statistician/ ML professionals", after all. What are some of the top-of-mind subjects for research statisticians in 2019?
$endgroup$
– olooney
Apr 12 at 17:00
$begingroup$
@Forgottenscience yes, that's a fair assessment. The original question was asking about "statistics and machine learning" for "statistician/ ML professionals", after all. What are some of the top-of-mind subjects for research statisticians in 2019?
$endgroup$
– olooney
Apr 12 at 17:00
add a comment |
$begingroup$
Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.
Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.
Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.
$endgroup$
1
$begingroup$
It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
$endgroup$
– LSC
Apr 11 at 12:44
$begingroup$
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
$endgroup$
– Skander H.
Apr 11 at 14:19
2
$begingroup$
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
$endgroup$
– LSC
Apr 11 at 14:44
add a comment |
$begingroup$
Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.
Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.
Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.
$endgroup$
1
$begingroup$
It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
$endgroup$
– LSC
Apr 11 at 12:44
$begingroup$
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
$endgroup$
– Skander H.
Apr 11 at 14:19
2
$begingroup$
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
$endgroup$
– LSC
Apr 11 at 14:44
add a comment |
$begingroup$
Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.
Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.
Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.
$endgroup$
Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.
Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.
Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.
edited Apr 11 at 9:38
community wiki
Skander H.
1
$begingroup$
It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
$endgroup$
– LSC
Apr 11 at 12:44
$begingroup$
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
$endgroup$
– Skander H.
Apr 11 at 14:19
2
$begingroup$
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
$endgroup$
– LSC
Apr 11 at 14:44
add a comment |
1
$begingroup$
It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
$endgroup$
– LSC
Apr 11 at 12:44
$begingroup$
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
$endgroup$
– Skander H.
Apr 11 at 14:19
2
$begingroup$
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
$endgroup$
– LSC
Apr 11 at 14:44
1
1
$begingroup$
It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
$endgroup$
– LSC
Apr 11 at 12:44
$begingroup$
It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work.
$endgroup$
– LSC
Apr 11 at 12:44
$begingroup$
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
$endgroup$
– Skander H.
Apr 11 at 14:19
$begingroup$
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process.
$endgroup$
– Skander H.
Apr 11 at 14:19
2
2
$begingroup$
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
$endgroup$
– LSC
Apr 11 at 14:44
$begingroup$
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience.
$endgroup$
– LSC
Apr 11 at 14:44
add a comment |
$begingroup$
What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).
What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously
Say that statistics is hard
Admit that they have little training or expertise in it and
Do it on their own anyway.
No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.
What I need to know is
When I am out of my depth. No one knows all this stuff, certainly I don't.
A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.
Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).
How to ask questions. A good data analyst asks a lot of questions.
Enough matrix algebra and calculus to at least read articles. But that's not all that much.
Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).
$endgroup$
add a comment |
$begingroup$
What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).
What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously
Say that statistics is hard
Admit that they have little training or expertise in it and
Do it on their own anyway.
No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.
What I need to know is
When I am out of my depth. No one knows all this stuff, certainly I don't.
A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.
Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).
How to ask questions. A good data analyst asks a lot of questions.
Enough matrix algebra and calculus to at least read articles. But that's not all that much.
Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).
$endgroup$
add a comment |
$begingroup$
What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).
What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously
Say that statistics is hard
Admit that they have little training or expertise in it and
Do it on their own anyway.
No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.
What I need to know is
When I am out of my depth. No one knows all this stuff, certainly I don't.
A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.
Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).
How to ask questions. A good data analyst asks a lot of questions.
Enough matrix algebra and calculus to at least read articles. But that's not all that much.
Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).
$endgroup$
What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).
What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously
Say that statistics is hard
Admit that they have little training or expertise in it and
Do it on their own anyway.
No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.
What I need to know is
When I am out of my depth. No one knows all this stuff, certainly I don't.
A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.
Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).
How to ask questions. A good data analyst asks a lot of questions.
Enough matrix algebra and calculus to at least read articles. But that's not all that much.
Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).
answered Apr 11 at 12:54
community wiki
Peter Flom
add a comment |
add a comment |
$begingroup$
For a person doing work in statistics or doing work associated with statistics there is not really much clear must-know knowledge.
Obviously, people should be able to do simple and ordinary things, e.g. simple arithmetic.
But beyond that, statistics and machine learning is enormously broad and multidisciplinary. You might have a person doing only work writing SQL and managing databases or a person collecting data for the state, e.g. stuff like eurostat (statistics is etymologically derived from 'state'). Should those people know the Kolmogorov axiom's or should they know all types of Pearson distributions?
It is a bit similar like asking what tools a construction worker must be able to work with. An electrician is not like a carpenter, and a plumber is not like a plasterer. There is very little that they all must know and it will only be a fraction of their abilities.
$endgroup$
5
$begingroup$
@igoR87 I have correctly answered your question. This answer is "what I consider a person must know". If you believe that this is a useless answer, then it is at least useful in showing how incomplete and useless the question is.
$endgroup$
– Martijn Weterings
Apr 12 at 8:46
3
$begingroup$
@igoR87 this is my serious answer. I did not post it at first because I find it not very useful. So, yes, you are right that my motivation to post this answer is different. But it is what would be my answer. The answer is serious, but posting the answer is not.
$endgroup$
– Martijn Weterings
Apr 12 at 8:57
2
$begingroup$
It is because of these open possibilities to view your question that I find the question too vague. I can not improve your question because I can not know what you wanted it to look like more specifically. If this answer is not what you are looking for then this helps you to understand what is wrong with your question, and you can improve to make it more like your intentions (I do not know your intentions, I only know the question is vague).
$endgroup$
– Martijn Weterings
Apr 12 at 8:58
1
$begingroup$
Posting this answer is how I try to work to a better question. Like Peter Flom says, statisticians need to be good in asking and enquiring. Understand the problem/question the client has, not just start editing it yourselve based on (possibly false) assumptions.
$endgroup$
– Martijn Weterings
Apr 12 at 9:02
1
$begingroup$
@igoR87, while my intention to answer may be weird, this is my true answer and this does convey "what I consider must know knowledge" (I believe there is none besides very simple basics like arithmetics). Besides that, indicating how a question is wrongly phrased or can be interpreted is not an uncommon answering style on stackexchange.
$endgroup$
– Martijn Weterings
Apr 12 at 9:04
|
show 5 more comments
$begingroup$
For a person doing work in statistics or doing work associated with statistics there is not really much clear must-know knowledge.
Obviously, people should be able to do simple and ordinary things, e.g. simple arithmetic.
But beyond that, statistics and machine learning is enormously broad and multidisciplinary. You might have a person doing only work writing SQL and managing databases or a person collecting data for the state, e.g. stuff like eurostat (statistics is etymologically derived from 'state'). Should those people know the Kolmogorov axiom's or should they know all types of Pearson distributions?
It is a bit similar like asking what tools a construction worker must be able to work with. An electrician is not like a carpenter, and a plumber is not like a plasterer. There is very little that they all must know and it will only be a fraction of their abilities.
$endgroup$
5
$begingroup$
@igoR87 I have correctly answered your question. This answer is "what I consider a person must know". If you believe that this is a useless answer, then it is at least useful in showing how incomplete and useless the question is.
$endgroup$
– Martijn Weterings
Apr 12 at 8:46
3
$begingroup$
@igoR87 this is my serious answer. I did not post it at first because I find it not very useful. So, yes, you are right that my motivation to post this answer is different. But it is what would be my answer. The answer is serious, but posting the answer is not.
$endgroup$
– Martijn Weterings
Apr 12 at 8:57
2
$begingroup$
It is because of these open possibilities to view your question that I find the question too vague. I can not improve your question because I can not know what you wanted it to look like more specifically. If this answer is not what you are looking for then this helps you to understand what is wrong with your question, and you can improve to make it more like your intentions (I do not know your intentions, I only know the question is vague).
$endgroup$
– Martijn Weterings
Apr 12 at 8:58
1
$begingroup$
Posting this answer is how I try to work to a better question. Like Peter Flom says, statisticians need to be good in asking and enquiring. Understand the problem/question the client has, not just start editing it yourselve based on (possibly false) assumptions.
$endgroup$
– Martijn Weterings
Apr 12 at 9:02
1
$begingroup$
@igoR87, while my intention to answer may be weird, this is my true answer and this does convey "what I consider must know knowledge" (I believe there is none besides very simple basics like arithmetics). Besides that, indicating how a question is wrongly phrased or can be interpreted is not an uncommon answering style on stackexchange.
$endgroup$
– Martijn Weterings
Apr 12 at 9:04
|
show 5 more comments
$begingroup$
For a person doing work in statistics or doing work associated with statistics there is not really much clear must-know knowledge.
Obviously, people should be able to do simple and ordinary things, e.g. simple arithmetic.
But beyond that, statistics and machine learning is enormously broad and multidisciplinary. You might have a person doing only work writing SQL and managing databases or a person collecting data for the state, e.g. stuff like eurostat (statistics is etymologically derived from 'state'). Should those people know the Kolmogorov axiom's or should they know all types of Pearson distributions?
It is a bit similar like asking what tools a construction worker must be able to work with. An electrician is not like a carpenter, and a plumber is not like a plasterer. There is very little that they all must know and it will only be a fraction of their abilities.
$endgroup$
For a person doing work in statistics or doing work associated with statistics there is not really much clear must-know knowledge.
Obviously, people should be able to do simple and ordinary things, e.g. simple arithmetic.
But beyond that, statistics and machine learning is enormously broad and multidisciplinary. You might have a person doing only work writing SQL and managing databases or a person collecting data for the state, e.g. stuff like eurostat (statistics is etymologically derived from 'state'). Should those people know the Kolmogorov axiom's or should they know all types of Pearson distributions?
It is a bit similar like asking what tools a construction worker must be able to work with. An electrician is not like a carpenter, and a plumber is not like a plasterer. There is very little that they all must know and it will only be a fraction of their abilities.
edited Apr 12 at 9:19
community wiki
3 revs
Martijn Weterings
5
$begingroup$
@igoR87 I have correctly answered your question. This answer is "what I consider a person must know". If you believe that this is a useless answer, then it is at least useful in showing how incomplete and useless the question is.
$endgroup$
– Martijn Weterings
Apr 12 at 8:46
3
$begingroup$
@igoR87 this is my serious answer. I did not post it at first because I find it not very useful. So, yes, you are right that my motivation to post this answer is different. But it is what would be my answer. The answer is serious, but posting the answer is not.
$endgroup$
– Martijn Weterings
Apr 12 at 8:57
2
$begingroup$
It is because of these open possibilities to view your question that I find the question too vague. I can not improve your question because I can not know what you wanted it to look like more specifically. If this answer is not what you are looking for then this helps you to understand what is wrong with your question, and you can improve to make it more like your intentions (I do not know your intentions, I only know the question is vague).
$endgroup$
– Martijn Weterings
Apr 12 at 8:58
1
$begingroup$
Posting this answer is how I try to work to a better question. Like Peter Flom says, statisticians need to be good in asking and enquiring. Understand the problem/question the client has, not just start editing it yourselve based on (possibly false) assumptions.
$endgroup$
– Martijn Weterings
Apr 12 at 9:02
1
$begingroup$
@igoR87, while my intention to answer may be weird, this is my true answer and this does convey "what I consider must know knowledge" (I believe there is none besides very simple basics like arithmetics). Besides that, indicating how a question is wrongly phrased or can be interpreted is not an uncommon answering style on stackexchange.
$endgroup$
– Martijn Weterings
Apr 12 at 9:04
|
show 5 more comments
5
$begingroup$
@igoR87 I have correctly answered your question. This answer is "what I consider a person must know". If you believe that this is a useless answer, then it is at least useful in showing how incomplete and useless the question is.
$endgroup$
– Martijn Weterings
Apr 12 at 8:46
3
$begingroup$
@igoR87 this is my serious answer. I did not post it at first because I find it not very useful. So, yes, you are right that my motivation to post this answer is different. But it is what would be my answer. The answer is serious, but posting the answer is not.
$endgroup$
– Martijn Weterings
Apr 12 at 8:57
2
$begingroup$
It is because of these open possibilities to view your question that I find the question too vague. I can not improve your question because I can not know what you wanted it to look like more specifically. If this answer is not what you are looking for then this helps you to understand what is wrong with your question, and you can improve to make it more like your intentions (I do not know your intentions, I only know the question is vague).
$endgroup$
– Martijn Weterings
Apr 12 at 8:58
1
$begingroup$
Posting this answer is how I try to work to a better question. Like Peter Flom says, statisticians need to be good in asking and enquiring. Understand the problem/question the client has, not just start editing it yourselve based on (possibly false) assumptions.
$endgroup$
– Martijn Weterings
Apr 12 at 9:02
1
$begingroup$
@igoR87, while my intention to answer may be weird, this is my true answer and this does convey "what I consider must know knowledge" (I believe there is none besides very simple basics like arithmetics). Besides that, indicating how a question is wrongly phrased or can be interpreted is not an uncommon answering style on stackexchange.
$endgroup$
– Martijn Weterings
Apr 12 at 9:04
5
5
$begingroup$
@igoR87 I have correctly answered your question. This answer is "what I consider a person must know". If you believe that this is a useless answer, then it is at least useful in showing how incomplete and useless the question is.
$endgroup$
– Martijn Weterings
Apr 12 at 8:46
$begingroup$
@igoR87 I have correctly answered your question. This answer is "what I consider a person must know". If you believe that this is a useless answer, then it is at least useful in showing how incomplete and useless the question is.
$endgroup$
– Martijn Weterings
Apr 12 at 8:46
3
3
$begingroup$
@igoR87 this is my serious answer. I did not post it at first because I find it not very useful. So, yes, you are right that my motivation to post this answer is different. But it is what would be my answer. The answer is serious, but posting the answer is not.
$endgroup$
– Martijn Weterings
Apr 12 at 8:57
$begingroup$
@igoR87 this is my serious answer. I did not post it at first because I find it not very useful. So, yes, you are right that my motivation to post this answer is different. But it is what would be my answer. The answer is serious, but posting the answer is not.
$endgroup$
– Martijn Weterings
Apr 12 at 8:57
2
2
$begingroup$
It is because of these open possibilities to view your question that I find the question too vague. I can not improve your question because I can not know what you wanted it to look like more specifically. If this answer is not what you are looking for then this helps you to understand what is wrong with your question, and you can improve to make it more like your intentions (I do not know your intentions, I only know the question is vague).
$endgroup$
– Martijn Weterings
Apr 12 at 8:58
$begingroup$
It is because of these open possibilities to view your question that I find the question too vague. I can not improve your question because I can not know what you wanted it to look like more specifically. If this answer is not what you are looking for then this helps you to understand what is wrong with your question, and you can improve to make it more like your intentions (I do not know your intentions, I only know the question is vague).
$endgroup$
– Martijn Weterings
Apr 12 at 8:58
1
1
$begingroup$
Posting this answer is how I try to work to a better question. Like Peter Flom says, statisticians need to be good in asking and enquiring. Understand the problem/question the client has, not just start editing it yourselve based on (possibly false) assumptions.
$endgroup$
– Martijn Weterings
Apr 12 at 9:02
$begingroup$
Posting this answer is how I try to work to a better question. Like Peter Flom says, statisticians need to be good in asking and enquiring. Understand the problem/question the client has, not just start editing it yourselve based on (possibly false) assumptions.
$endgroup$
– Martijn Weterings
Apr 12 at 9:02
1
1
$begingroup$
@igoR87, while my intention to answer may be weird, this is my true answer and this does convey "what I consider must know knowledge" (I believe there is none besides very simple basics like arithmetics). Besides that, indicating how a question is wrongly phrased or can be interpreted is not an uncommon answering style on stackexchange.
$endgroup$
– Martijn Weterings
Apr 12 at 9:04
$begingroup$
@igoR87, while my intention to answer may be weird, this is my true answer and this does convey "what I consider must know knowledge" (I believe there is none besides very simple basics like arithmetics). Besides that, indicating how a question is wrongly phrased or can be interpreted is not an uncommon answering style on stackexchange.
$endgroup$
– Martijn Weterings
Apr 12 at 9:04
|
show 5 more comments
7
$begingroup$
I am voting to reopen this question and convert it to a wiki.
$endgroup$
– Ferdi
Apr 11 at 7:43
3
$begingroup$
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place?
$endgroup$
– mdewey
Apr 11 at 8:42
3
$begingroup$
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not.
$endgroup$
– Martijn Weterings
Apr 11 at 13:22
3
$begingroup$
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/…
$endgroup$
– Sycorax
Apr 11 at 13:36
4
$begingroup$
Note that the proper status of this thread is being discussed on Cross Validated Meta: This opinion-based question must be closed. A strict reading of the standards for SE threads would unambiguously lead to this Q being closed. However, in the past we have sometimes made exceptions for threads that seemed to have a lot of value that would otherwise be lost (& made those threads CW). It remains to be seen whether this thread should be judged to fall in that 'gray area' category or not.
$endgroup$
– gung♦
Apr 11 at 13:58