Otdfbt

Question

There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.

So my question is: What do you consider a statician/ ML professional must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?

The answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career.

I am voting to reopen this question and convert it to a wiki. — Apr 11 at 7:43
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place? — Apr 11 at 8:42
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not. — Apr 11 at 13:22
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/… — Apr 11 at 13:36
Note that the proper status of this thread is being discussed on Cross Validated Meta: This opinion-based question must be closed. A strict reading of the standards for SE threads would unambiguously lead to this Q being closed. However, in the past we have sometimes made exceptions for threads that seemed to have a lot of value that would otherwise be lost (& made those threads CW). It remains to be seen whether this thread should be judged to fall in that 'gray area' category or not. — Apr 11 at 13:58

score 14 · Accepted Answer · 2019-04-12 17:16:29Z

The two worlds that you describe aren't really two different kinds of statistician, but rather:

"statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

statistics proper, as understood by mathematicians, statisticians, data scientists, etc.

The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.

The problem is that without a fairly in-depth understanding, they:

are very likely to misuse statistics

can't stray from the garden path

Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.

The following poem comes to mind:

A little learning is a dangerous thing;

Drink deep, or taste not the Pierian spring:

There shallow draughts intoxicate the brain,

And drinking largely sobers us again.

- Alexander Pope, A Little Learning

I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."

What do you consider as must to know in statistics and machine learning?

The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.

This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.

There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!

At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.

But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.

If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.

Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.

At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.

What tests/ methods would you put in your toolbox?

All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:

Ridge, Lasso, and ElasticNet Regression

Local Regression (LOESS)

Kernel Density Estimates

PCA

Factor Analysis

K-means

GMM (and other mixture models)

Decision Trees, Random Forest, and XGBoost

Time Series Analysis: ARIMA, possible exponential smoothing

SVM (Support Vector Machines)

Hidden Markov Models

GAM (General Additive Models)

Bayes Networks and Structual Equation Modeling

Robust Regression

Imputation

Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.

Bayesian Inference with MCMC a la Stan

Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)

Extreme value theory

Vapnik–Chervonenkis theory

Causality

Pairwise/Perference modling e.g. Bradley-Terry

IRT (item response theory, used for surveys and tests)

Martingales

Copulas

This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.

Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap. — Apr 11 at 12:58
This post is a nice example of what a machine learning pracitioner/data scientist thinks statistics is, and if one knows all the above well one can do some serious modelling. However, it is very far from what a statistician would say a statistician should know, or what statistics comprises (as in what is published in top journals). — Apr 12 at 8:38
@Forgottenscience yes, that's a fair assessment. The original question was asking about "statistics and machine learning" for "statistician/ ML professionals", after all. What are some of the top-of-mind subjects for research statisticians in 2019? — Apr 12 at 17:00

score 9 · Accepted Answer · 2019-04-11 09:38:56Z

Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.

Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.

Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.

It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work. — Apr 11 at 12:44
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process. — Apr 11 at 14:19
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience. — Apr 11 at 14:44

score 6 · Accepted Answer · 2019-04-11 12:54:11Z

What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).

What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously

Say that statistics is hard

Admit that they have little training or expertise in it and

Do it on their own anyway.

No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.

What I need to know is

When I am out of my depth. No one knows all this stuff, certainly I don't.

A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.

Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).

How to ask questions. A good data analyst asks a lot of questions.

Enough matrix algebra and calculus to at least read articles. But that's not all that much.

Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).

score 2 · Accepted Answer · 2019-04-12 09:19:58Z

2

For a person doing work in statistics or doing work associated with statistics there is not really much clear must-know knowledge.

Obviously, people should be able to do simple and ordinary things, e.g. simple arithmetic.

But beyond that, statistics and machine learning is enormously broad and multidisciplinary. You might have a person doing only work writing SQL and managing databases or a person collecting data for the state, e.g. stuff like eurostat (statistics is etymologically derived from 'state'). Should those people know the Kolmogorov axiom's or should they know all types of Pearson distributions?

It is a bit similar like asking what tools a construction worker must be able to work with. An electrician is not like a carpenter, and a plumber is not like a plasterer. There is very little that they all must know and it will only be a fraction of their abilities.

edited Apr 12 at 9:19

community wiki

3 revs
Martijn Weterings

5

$begingroup$
@igoR87 I have correctly answered your question. This answer is "what I consider a person must know". If you believe that this is a useless answer, then it is at least useful in showing how incomplete and useless the question is.
$endgroup$
– Martijn Weterings
Apr 12 at 8:46

3

$begingroup$
@igoR87 this is my serious answer. I did not post it at first because I find it not very useful. So, yes, you are right that my motivation to post this answer is different. But it is what would be my answer. The answer is serious, but posting the answer is not.
$endgroup$
– Martijn Weterings
Apr 12 at 8:57

2

$begingroup$
It is because of these open possibilities to view your question that I find the question too vague. I can not improve your question because I can not know what you wanted it to look like more specifically. If this answer is not what you are looking for then this helps you to understand what is wrong with your question, and you can improve to make it more like your intentions (I do not know your intentions, I only know the question is vague).
$endgroup$
– Martijn Weterings
Apr 12 at 8:58

1

$begingroup$
Posting this answer is how I try to work to a better question. Like Peter Flom says, statisticians need to be good in asking and enquiring. Understand the problem/question the client has, not just start editing it yourselve based on (possibly false) assumptions.
$endgroup$
– Martijn Weterings
Apr 12 at 9:02

1

$begingroup$
@igoR87, while my intention to answer may be weird, this is my true answer and this does convey "what I consider must know knowledge" (I believe there is none besides very simple basics like arithmetics). Besides that, indicating how a question is wrongly phrased or can be interpreted is not an uncommon answering style on stackexchange.
$endgroup$
– Martijn Weterings
Apr 12 at 9:04

|
show 5 more comments

score 14 · Accepted Answer · 2019-04-12 17:16:29Z

The two worlds that you describe aren't really two different kinds of statistician, but rather:

"statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

statistics proper, as understood by mathematicians, statisticians, data scientists, etc.

The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.

The problem is that without a fairly in-depth understanding, they:

are very likely to misuse statistics

can't stray from the garden path