Cell Information Gain¶
For simplicity, it is assumed that every cell belongs to a row, and every individual is represented by exactly one row. The CIG quantifies the information gained by learning the value of a cell, given that one already knows all the other cell values of this individual.
We use entropy to measure information. The entropy of a random variable is the average level of uncertainty inherent in the variable’s possible outcomes.
As the attacker already has an expectation of the distribution (prior) of that variable, we define the CIG as the change in entropy (or KL-divergence) between the prior and posterior distribution.
Example¶
Consider the following dataset
Gender |
Eye color |
Occupation |
---|---|---|
male |
blue |
dentist |
female |
blue |
dentist |
male |
green |
accountant |
male |
green |
accountant |
For the sake of exposition, we focus on the feature ‘Gender’.
First, we need the ‘Gender’s prior distribution. There are 51% males and 49% females in the Australian population. Then 0.51 and 0.49 form the prior distribution.
Feature |
Prior Distribution |
---|---|
male |
0.51 |
female |
0.49 |
Knowing values for ‘Eye color’ and ‘Occupation’ gives context. The posterior distribution of ‘Gender’ is the conditional distribution given its context.
The posterior of ‘Gender’ is given by P(Gender|Eye color, Occupation) as follows:
Eye color |
Occupation |
P(Gender=male|Eye color, Occupation) |
P(Gender=female|Eye color, Occupation) |
---|---|---|---|
blue |
dentist |
0.5 |
0.5 |
green |
accountant |
1 |
0 |
We can see that the posterior distribution for the blue-eyed dentists is very similar to the population prior. As the distribution of ‘Gender’ within the cohort of blue-eyed dentists is essentially the same as the population prior, we associate little risk with ‘Gender’ values for this cohort. The posterior distribution of ‘Gender’ for the green-eyed accountants on the other hand is significantly different from the prior. Thus there is more information to be gained.