The Mysterious Correlation
A detective story
Evelina Gabašová
@evelgab

Correlation is not causation!
Correlation and causation
Everyone loves a nice correlation!
![]()
Causal language
for correlation data
Things to check
- correlation or causation (randomised study?)
- small sample size
- result by chance
- hidden cause (latent variable)
Sample size?
<= 5 years |
4508 |
6106 |
2468 |
6-10 |
3917 |
3210 |
1386 |
11-15 |
1471 |
1052 |
445 |
15+ |
2080 |
1291 |
723 |
Sample size

Do storks deliver babies?
Source: Claude Covo-Farchi
Matthews, R. (2000), Storks Deliver Babies (p= 0.008). Teaching Statistics, 22: 36–38.
Correlation and causation
- Randomised experiments
- A/B testing
Correlation and causation
from observational data
Possible, but we need a lot of assumptions
- know all the variables
- know the right model
do-calculus
Predicting salary with linear regression
- Country
- Years of programming experience
- Tabs and spaces usage
- Developer type and language
- Level of formal education (e.g. bachelor’s, master’s, doctorate)
- Whether they contribute to open source
- Whether they program as a hobby
- Company size
What happens if we remove Tabs and Spaces?
Diving deeper into linear regression
- Full model with the information on tabs and spaces included
- Reduced model without the information on tabs and spaces
Coefficient of determination
how much variance in salary can the model explain
Full model |
0.4008 |
0.3892 |
Reduced model |
0.3938 |
0.3892 |
What changed in the reduced model?
More significant in the reduced model
- Years of programming experience
- Contributing to open source
- PHP
Open source contributors use spaces more than tabs

Language effects?

Language effects?

Language and open source?
Tabs, spaces, open source & salary
How does it fit together?

Exploring salary distributions
Based on experience level

What’s different for these users?
… more statistical testing
The importance of version control
Git |
168 |
660 |
I use some other system |
17 |
30 |
Subversion |
4 |
47 |
Team Foundation Server |
6 |
92 |
Version control and tabs/spaces
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 258.48, df = 18, p-value < 2.2e-16
Version control and salary

Git and Subversion

Why is version control so important?
But can we trust the data?
Salary distribution

Missing data

Statistics of missing data
- missing completely at random
- missing at random
- missing not at random
Missing completely at random
![]()
Missing at random
![]()
Missing not at random
![]()
Missing data on salaries
typically missing at random or missing not at random
Data traps
accidentally or deliberately
Interpretation
Machine learning as a service
Machine learning
as learning-by-association
“As much as I look into what’s being done with deep learning, I see they’re all stuck there on the level of associations. Curve fitting.”
Judea Pearl
Beyond “Correlation is not causation”

Evelina Gabasova
Consulting data detective
@evelgab
evelinag.com