API Documentation

PIF Calculator

piflib.pif_calculator.apply_to_posterior_and_prior(dataset, feature_idx, prior_distributions, accuracies, fun)[source]
piflib.pif_calculator.binom(n, r)[source]

return binomial coefficient: n choose k

piflib.pif_calculator.calculate_kl(p, q)[source]

Calculate D_KL(P || Q) (the KL-divergence) in bits.

D_KL(P || Q) is the information gained when one revises one’s beliefs from the prior probability distribution Q to the posterior probability distribution P. (Wikipedia, Kullback–Leibler divergence)

p and q are both dictionaries mapping some hashable to a number. It is assumed that they are both normalised: their values should add up to 0. q must not have any 0 values unless the corresponding p value is also 0.

piflib.pif_calculator.calculate_prob_change(p, q)[source]

calculate the change in probability for each element of the posterior compared to the prior

piflib.pif_calculator.compute_cigs(dataframe, feature_priors={}, feature_accuracies={}, samples=None)[source]

Compute the cell information gain (CIG) for all cells in the dataset.

Find the risk (as KL divergence from prior) for all attributes.

Parameters
  • dataframe – a Pandas DataFrame object containing tabular data

  • feature_priors – feature_priors are optional. It is a dictionary mapping the feature index to an assumed prior. If not provided, the prior for the feature is calculated from the global distribution.

  • feature_accuraciesfeature_accuracies maps the feature index to the accuracy of the feature. If not provided for a feature, it defaults to 1.

Returns

a Pandas DataFrame containing the CIG values. The CIG values are at the same index as their corresponding cell values in the input dataframe.

piflib.pif_calculator.compute_csfs(df, feature_priors={}, feature_accuracies={})[source]

Compute the Cell Surprise Factor (CSF) for all cells in the dataset.

The CSF id defined as the change in probability for a cell value between the prior and the posterior distribution.

Parameters
  • dataframe – a Pandas DataFrame object containing tabular data

  • feature_priors – feature_priors are optional. It is a dictionary mapping the feature index to an assumed prior. If not provided, the prior for the feature is calculated from the global distribution.

  • feature_accuraciesfeature_accuracies maps the feature index to the accuracy of the feature. If not provided for a feature, it defaults to 1.

Returns

a Pandas DataFrame containing the CSF values. The CSF values are at the same index as their corresponding cell values in the input dataframe.

piflib.pif_calculator.compute_pif(cigs, percentile)[source]

compute the PIF.

The PIF is defined as the n-th percentile of the individual RIG values. Or in other words, the RIG of n percent of the entities in the dataset does not exceed the PIF value.

RIG stands for row information gain. It represents the overall information gain for an entity in the dataset. The RIG is computed by summing the CIG values of an entity.

The percentile value can be chosen between 0 and 100. 100 will return the maximum RIG value. Often, the RIG values from a long tail distribution with few high value outliers. Choosing a percentile value lower than 100 will ignore (some of) the highest values. If ignoring the risk of some entities in the dataset fits within your risk framework, then specifying a percentile value of less than 100 will make the PIF value less susceptible to RIG outliers.

Parameters
  • cigs – The CIG values of the dataset (see the compute_cigs function in this module)

  • percentile – Which percentile of RIG values should be included in the PIF.

Returns

the PIF_percentile value of the given CIGs

piflib.pif_calculator.compute_posterior_distributions(feature, df)[source]
piflib.pif_calculator.compute_weighted_cigs(dataframe, feature_priors={}, feature_accuracies={})[source]

Compute the Weighted Cell Information Gain (wCIG) for all cells in the dataset.

Find the risk (as KL divergence from prior) for all attributes.

Parameters
  • dataframe – a Pandas DataFrame object containing tabular data

  • feature_priors – feature_priors are optional. It is a dictionary mapping the feature index to an assumed prior. If not provided, the prior for the feature is calculated from the global distribution.

  • feature_accuraciesfeature_accuracies maps the feature index to the accuracy of the feature. If not provided for a feature, it defaults to 1.

Returns

a Pandas DataFrame containing the wCIG values. The wCIG values are at the same index as their corresponding cell values in the input dataframe.

piflib.pif_calculator.find_kls_for_features(dataset, feature_is, feature_distributions, accuracies)[source]

Find the KL divergence of feature values against the prior.

We find the true distribution of the features taking into account the accuracy. We then compute the KL divergence.

piflib.pif_calculator.sample_is(n, r, samples)[source]