Weighted Cell Information Gain¶

The CIG assumes that the unknown feature is independent from all other features. Features containing personal information often have inter-dependencies, e.g. firstnames and gender, or postcode and income.

One can look at the CIG as the worst-case scenario. Nothing of the uncertainty of the unknown feature can be explained by the other features.

The Weighted Cell Information Gain explores the best-case scenario: we assume that all observed correlations are due to causal dependencies between the features.

Let $X_j$ be the unknown feature. Then $H(X_j)$ , the entropy of feature $j$ , describes the amount of information contained in that feature. The conditional entropy $H\left(X_{j} \mid X_{1}, \ldots, X_{j-1}, X_{j+1}, \ldots, X_{m}\right)$ describes the amount of information contained in feature $j$ , given that all other feature values are known (taking all possible correlations into account).

Dividing the conditional entropy by the entropy of the feature, we get a factor that describes what fraction of the information in a feature can not be explained by the correlations with all other features.

$w_j =\frac{H\left(X_{j} \mid X_{1}, \ldots, X_{j-1}, X_{j+1}, \ldots, X_{m}\right)}{H(X_j)},$

The wCIG is defined as the CIG value multiplied by factor $w_j$ .

$wCIG(i,j) = w_j * CIG(i, j)$

Caution¶

Correlation does not mean causation. A trivial counterexample is the following dataset:

A	B
a	b
c	r
f	e

There is a perfect correlation between feature A and B, thus all wCIG values are zero. However, there is no causal relationship between the two. In fact, the CIG values are quite high, as all values are unique.