Cell Information Gain

For simplicity, it is assumed that every cell belongs to a row, and every individual is represented by exactly one row. The CIG quantifies the information gained by learning the value of a cell, given that one already knows all the other cell values of this individual.

We use entropy to measure information. The entropy of a random variable is the average level of uncertainty inherent in the variable’s possible outcomes.

As the attacker already has an expectation of the distribution (prior) of that variable, we define the CIG as the change in entropy (or KL-divergence) between the prior and posterior distribution.

Example

Consider the following dataset

Gender

Eye color

Occupation

male

blue

dentist

female

blue

dentist

male

green

accountant

male

green

accountant

For the sake of exposition, we focus on the feature ‘Gender’.

First, we need the ‘Gender’s prior distribution. There are 51% males and 49% females in the Australian population. Then 0.51 and 0.49 form the prior distribution.

Feature

Prior Distribution

male

0.51

female

0.49

Knowing values for ‘Eye color’ and ‘Occupation’ gives context. The posterior distribution of ‘Gender’ is the conditional distribution given its context.

The posterior of ‘Gender’ is given by P(Gender|Eye color, Occupation) as follows:

Eye color

Occupation

P(Gender=male|Eye color, Occupation)

P(Gender=female|Eye color, Occupation)

blue

dentist

0.5

0.5

green

accountant

1

0

We can see that the posterior distribution for the blue-eyed dentists is very similar to the population prior. As the distribution of ‘Gender’ within the cohort of blue-eyed dentists is essentially the same as the population prior, we associate little risk with ‘Gender’ values for this cohort. The posterior distribution of ‘Gender’ for the green-eyed accountants on the other hand is significantly different from the prior. Thus there is more information to be gained.