The Mysterious Correlation
A detective story
Evelina Gabašová
@evelgab
Correlation is not causation!
Correlation and causation
Everyone loves a nice correlation!
Causal language
for correlation data
Things to check
- correlation or causation (randomised study?)
- small sample size
- result by chance
- hidden cause (latent variable)
Sample size?
<= 5 years |
4508 |
6106 |
2468 |
6-10 |
3917 |
3210 |
1386 |
11-15 |
1471 |
1052 |
445 |
15+ |
2080 |
1291 |
723 |
Sample size
Do storks deliver babies?
Source: Claude Covo-Farchi
Matthews, R. (2000), Storks Deliver Babies (p= 0.008). Teaching Statistics, 22: 36–38.
Correlation and causation
- Randomised experiments
- A/B testing
Correlation and causation
from observational data
Possible, but we need a lot of assumptions
- know all the variables
- know the right model
do-calculus
Predicting salary with linear regression
- Country
- Years of programming experience
- Tabs and spaces usage
- Developer type and language
- Level of formal education (e.g. bachelor’s, master’s, doctorate)
- Whether they contribute to open source
- Whether they program as a hobby
- Company size
What happens if we remove Tabs and Spaces?
Diving deeper into linear regression
- Full model with the information on tabs and spaces included
- Reduced model without the information on tabs and spaces
Coefficient of determination
how much variance in salary can the model explain
Full model |
0.4008 |
0.3892 |
Reduced model |
0.3938 |
0.3892 |
What changed in the reduced model?
More significant in the reduced model
- Years of programming experience
- Contributing to open source
- PHP
Open source contributors use spaces more than tabs
Language effects?
Language effects?
Language and open source?
Tabs, spaces, open source & salary
How does it fit together?
Exploring salary distributions
Based on experience level
What’s different for these users?
… more statistical testing
The importance of version control
Git |
168 |
660 |
I use some other system |
17 |
30 |
Subversion |
4 |
47 |
Team Foundation Server |
6 |
92 |
Version control and tabs/spaces
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 258.48, df = 18, p-value < 2.2e-16
Version control and salary
Git and Subversion
Why is version control so important?
But can we trust the data?
Salary distribution
Missing data
Statistics of missing data
- missing completely at random
- missing at random
- missing not at random
Missing completely at random
Missing at random
Missing not at random
Missing data on salaries
typically missing at random or missing not at random
Data traps
accidentally or deliberately
Interpretation
Machine learning as a service
Machine learning
as learning-by-association
“As much as I look into what’s being done with deep learning, I see they’re all stuck there on the level of associations. Curve fitting.”
Judea Pearl
Beyond “Correlation is not causation”
Evelina Gabasova
Consulting data detective
@evelgab
evelinag.com