Abstract:
Frequent itemset mining is a field of data mining wherein we extract frequent itemsets from the
dataset. This may reveal sensitive patterns. Privacy Preserving Data Mining(PPDM) approaches are
used to hide sensitive information from the dataset but they also reduce the utility of the dataset.
Heuristics-based PPDM approaches remove the sensitive patterns from the transactions containing
them, based on some heuristics. Heuristic-based approaches are simple and take lesser
computational time as compared to the border-based and exact approaches. Hence they have been
given much attention by researchers for exploring better heuristics that can preserve the utility of
data to a great extent. In this work, we have proposed two heuristics-based approaches- Removal of
Closed Sensitive Itemsets with Maximum Support (MaxRCSI) and Removal of Closed Sensitive
Itemsets with Minimum Support (MinRCSI). In these proposed approaches, sensitive itemsets are
reduced to closed sensitive itemsets and sanitization process is carried over reduced closed sensitive
itemsets. Experiments have been performed on real datasets as well as on benchmark dataset where
the proposed approaches have resulted into the sanitized data with substantially better utility as
compared to the existing approaches. But these sequential approaches are not able to cope up with
the massive amount of data. The other two proposed approaches- Parallelized Removal of Closed
Patterns with Minimum Support (MinPRCP) and Parallelized Removal of Closed Patterns with
Maximum Support (MaxPRCP) are the parallel implementation of MinRCSI and MaxRCSI on
spark parallel computing framework. These parallelized approaches are scalable enough for
handling large dataset. Experiments performed using benchmark datasets shows that MinPRCP and
MaxPRCP scales better as compared to MinRCSI, MaxRCSI, and other sequential approaches