How machine learning helps cancer research

Evelina Gabasova

@evelgab

MRC Cancer Unit, University of Cambridge

DNA

Cost of whole genome sequencing

Sequencing data

DNA and genes



Cancer

  • Genetic mutations
  • Oncogenes and tumour suppressors


BRCA1 and BRCA2 are chromosome guardians



  • Cancer is not a single disease

Clustering

Example

Clustering wholesale customers

440 wholesale customers

Annual spending on

  • Fresh produce
  • Milk products
  • Grocery products
  • Frozen products
  • Detergents and paper
  • Delicatessen

Methods for clustering

  • k-means clustering
  • hierarchical clustering
  • spectral clustering
  • Gaussian mixture model and other probabilistic methods
  • ...

Visualisation of high-dimensional data

Principal component analysis

Clustering cancer data

Genes instead of customers

Gene expression instead of spending on products

Conventional medicine

Precision medicine

Clustering in cancer research

TCGA breast cancer
Clustering 368 tumour samples based on expression of 648 genes.

Integrative clustering

Integrative clustering

Collaborative filtering

Example

The Netflix prize

User

Film 1

Film 2

Film 3

Film 4

...

Film 1000

Alice

5

2

x

x

...

4

Bob

x

1

2

x

...

2

Carol

2

x

3

?

...

3

...

...

...

...

...

...

...

Zoe

x

5

4

5

...

x

Matrix factorization

Collaborative filtering of cancer data

Patients instead of users

DNA mutations instead of film ratings

Mutational signatures

Patient

C/A

C/G

C/T

T/A

T/C

T/G

Alice

5

2

0

0

3

4

Bob

0

1

2

0

0

2

Carol

2

0

3

0

1

3

...

...

...

...

...

...

...

Zoe

0

5

4

5

2

0

Matrix factorization to identify features

Matrix factorization to identify features

Matrix factorization to identify features

Proving system stability

Theorem proving

SAT

(A ∨ ¬B ) ∧ (¬ A ∨ B)

A = true
B = true

Theorem proving

Satisfiability Modulo Theories (SMT)

(A ∨ ¬B ) ∧ (¬ A ∨ B)

((a > 3) ∨ (b < 1)) ∧ ((a < 5) ∨ (b = 0))

a = 4
b = 0

Software verification

Z3 theorem prover

Preconditions
Postconditions
Loop conditions SMT formulas
Assertions
...

Software verification

Spec#

Software verification

Spec#

Proving stability of biological processes

Proteins

Genes Variables
Receptors

v + 1if v < T(v)
v v if v = T(v)
v - 1if v > T(v)

Bio Model Analyser

Chronic myeloid leukemia

Proving stability of biological systems

Chronic myeloid leukemia

Machine learning is not just

for targeted advertising

or algorithmic trading

@evelgab
evelina@evelinag.com
github.com/evelinag
evelinag.com

Links