Personal Information Factor (PIF) - Computing the cell information gain (CIG)¶

The PIF tries to answer the question:

“Knowing everything about a person but one feature’s value, what’s the information one would gain learning that value?”

The information gain is a measure of how unexpected the value is. The higher the information gain, the more unusual the value is, given the values for all remaining features.

We compute the information gain as the KL-divergence between the distribution of values of the whole dataset (the features’ priors) and the distribution of a feature’s values given all remaining features’ values (posterior).

[12]:

import collections
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
import seaborn as sns

import piflib

from tutorial_helpers import horizontal_bar_plot

Example dataset¶

We define a toy dataset to explain the process. Feel free to modify the dataset and examine the behaviour of the corresponding CIG values.

[13]:

data = {'gender': (['male'] * 6)+['female'],
        'name': ['Anton', 'Bill', 'Charlie', 'Don', 'Emil', 'Emil', 'Charlie'],
        'eye_color': ['blue', 'green', 'green', 'green', 'blue', 'green', 'green']}
df = pd.DataFrame(data)
df

[13]:

	gender	name	eye_color
0	male	Anton	blue
1	male	Bill	green
2	male	Charlie	green
3	male	Don	green
4	male	Emil	blue
5	male	Emil	green
6	female	Charlie	green

The features’ priors¶

Looking at the dataset as a whole, what is the distribution of the values of each feature. Not having any information about a person, this is what we expect him/her to look like.

[14]:

from piflib.data_util import calculate_distribution

[15]:

for feature in df.columns:
    dist = calculate_distribution(df[feature])
    horizontal_bar_plot({'': list(dist.values())}, dist.keys())
    plt.title(f'Prior distribution of: {feature}')

../_images/tutorials_tutorial_cig_6_0.png

../_images/tutorials_tutorial_cig_6_1.png

../_images/tutorials_tutorial_cig_6_2.png

The posterior distributions¶

To compute a feature’s posterior distribution, we have to take its context into account. For example, in the given dataset, there are two posterior distributions for the feature ‘name’. One where the pair gender and eye color is “male, green” and one for “male, blue”.

[16]:

for feature in df.columns:
    known_features = tuple(col_name for col_name in df.columns if col_name != feature)
    bucket = collections.defaultdict(list)
    bucket_map = []
    for idx, row in df.iterrows():
        key = tuple(row[known_feature] for known_feature in known_features)
        bucket[key].append(row[feature])
        bucket_map.append(key)

    bucket_distributions = {key: calculate_distribution(el_bucket) for key, el_bucket in bucket.items()}
    feature_vals = df[feature].unique()
    dists = {}
    for key, distribution in bucket_distributions.items():
        dists[str(key)] = [distribution.get(feature_val, 0) for feature_val in feature_vals]

    horizontal_bar_plot(dists, feature_vals)
    plt.title(f'unknown feature: {feature}')
    plt.ylabel('known other feature values')
    plt.show()

../_images/tutorials_tutorial_cig_8_0.png

../_images/tutorials_tutorial_cig_8_1.png

../_images/tutorials_tutorial_cig_8_2.png

The CIG values:¶

Given a features’ prior and posterior distributions, one can compute the KL-divergence between the two for each cell in the dataset. This will form the CIG value.

[17]:

piflib.compute_cigs(df).round(2)

[17]:

	gender	name	eye_color
0	0.22	1.31	1.81
1	0.22	0.31	0.49
2	0.51	0.31	0.49
3	0.22	0.31	0.49
4	0.22	1.31	0.15
5	0.22	0.31	0.15
6	0.51	1.81	0.49

Looking at both CIG values that correspond to a ‘blue’ eye color, we can see that they got assigned very different CIG values. In row 0 the CIG is 1.81, whereas in row 4 the CIG is 0.15. To understand the difference, you have to appreciate the different cohorts that fed into the CIG computation. The cohort (gender=male, name=Anton) that forms the posterior in row 0 is of size 1, whereas the cohort for row 4 (gender=male, name=Emil) is of size 2. The eye_color distribution of the second cohort is similar to the prior, thus the CIG value is low.

Or looking at it from an attacker perspective, if I know that the target’s name is Anton with a male gender, I would learn his eye color. In contrast, if the target’s name is Emil with a male gender, the attacker is left with a 50/50 chance for blue and green. This is close to his prior believe of 29/71.

I also want to draw attention to line 1. Why is the CIG different, given that here too, the cohort size is 1? This is explained by the prior believe. The attacker believes that 71% of the people in the dataset have green eyes. Thus, learning that Bill has green eyes is less of a surprise than learning that Anton has blue eyes.

Going bigger - hackathon dataset¶

[18]:

hack_features = ['gender', 'AGE', 'POSTCODE', 'blood_group', 'eye_color', 'job']
hack_data = pd.read_csv('data/hackathon.csv')[hack_features]
hack_data = hack_data.fillna('Unemployed')
hack_data.head()

[18]:

	gender	AGE	POSTCODE	blood_group	eye_color	job
0	F	99	2649	B-	Brown	Psychologist, counselling
1	M	108	1780	A-	Hazel	Personnel officer
2	M	59	2940	B+	Hazel	Tourism officer
3	M	58	2945	B+	Blue	Make
4	M	30	2729	AB-	Brown	Forest/woodland manager

We now compute the CIG values and display them as a heatmap, with colors ranging from green for ‘save’ values to red for the most at risk values.

[19]:

hack_cig = piflib.compute_cigs(hack_data)
color_map = matplotlib.colors.ListedColormap(
        sns.color_palette("RdYlGn", 256).as_hex()[::-1])
sns.heatmap(hack_cig, cmap=color_map)

[19]:

<AxesSubplot:>

../_images/tutorials_tutorial_cig_15_1.png

Looking at the distribution of CIG values, we can see that the changes are quite substantial.

[20]:

hack_cig.describe()

[20]:

	gender	AGE	POSTCODE	blood_group	eye_color	job
count	38462.000000	38462.000000	38462.000000	38462.000000	38462.000000	38462.000000
mean	0.996891	6.827838	10.058929	2.987481	2.313486	8.306507
std	0.068946	0.223181	2.583055	0.121209	0.093568	2.667701
min	0.000887	4.029292	4.754066	0.667983	0.730567	2.606579
25%	0.950303	6.788203	8.572935	2.979960	2.311166	9.300409
50%	0.950303	6.860459	8.821755	2.996329	2.324443	9.530706
75%	1.051470	6.904717	11.909218	3.011675	2.330845	9.707584
max	1.051470	11.909218	15.231146	3.023827	2.332356	10.707584

Let’s try to reduce the CIG values by removing the features ‘job’ and ‘POSTCODE’. Removing features will lead to larger cohort sizes for the posterior distributions. Alternatively, you could also

[21]:

cols = ['gender', 'AGE', 'blood_group', 'eye_color']
sub_hack_cig = piflib.compute_cigs(hack_data[cols])
sns.heatmap(sub_hack_cig, cmap=color_map, vmax=15)

[21]:

<AxesSubplot:>

../_images/tutorials_tutorial_cig_19_1.png

[22]:

sub_hack_cig.describe()

[22]:

	gender	AGE	blood_group	eye_color
count	38462.000000	38462.000000	38462.000000	38462.000000
mean	0.094231	0.181759	0.164312	0.147753
std	0.157085	0.028178	0.106338	0.122034
min	0.000114	0.129416	0.015208	0.000030
25%	0.008432	0.161430	0.094626	0.063763
50%	0.040053	0.180967	0.142773	0.121803
75%	0.099452	0.197749	0.211684	0.190624
max	1.051470	0.253509	3.023827	2.332356

This already looks a lot better. The mean CIG values are a lot lower now. However, there are still some rows in the dataset with high CIG values. These rows still stand out and thus have a higher risk of re-identification.