Weighted Cell Information Gain¶
The CIG assumes that the unknown feature is independent from all other features. Features containing personal information often have inter-dependencies, e.g. firstnames and gender, or postcode and income.
One can look at the CIG as the worst-case scenario. Nothing of the uncertainty of the unknown feature can be explained by the other features.
The Weighted Cell Information Gain explores the best-case scenario: we assume that all observed correlations are due to causal dependencies between the features.
Let be the unknown feature. Then , the entropy of feature , describes the amount of information contained in that feature. The conditional entropy describes the amount of information contained in feature , given that all other feature values are known (taking all possible correlations into account).
Dividing the conditional entropy by the entropy of the feature, we get a factor that describes what fraction of the information in a feature can not be explained by the correlations with all other features.
The wCIG is defined as the CIG value multiplied by factor .
Caution¶
Correlation does not mean causation. A trivial counterexample is the following dataset:
A |
B |
---|---|
a |
b |
c |
r |
f |
e |
There is a perfect correlation between feature A and B, thus all wCIG values are zero. However, there is no causal relationship between the two. In fact, the CIG values are quite high, as all values are unique.