TL;DRAbstract
Similarity Join is an important operation for data mining, with a diverse range of real world applications.Three efficient MapReduce Algorithms for performing Similarity Joins between multisets are proposed in this thesis.Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm.Multisets represent real world data better, by considering the frequency of its elements.Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model.Adeptly incorporating the filtering techniques in a strategic sequen
Chat with Paper
AI Agents for this Paper
Similarity Join is an important operation for data mining, with a diverse range of real world applications.Three efficient MapReduce Algorithms for performing Similarity Joins between multisets are proposed in this thesis.Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm.Multisets represent real world data better, by considering the frequency of its elements.Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model.Adeptly incorporating the filtering techniques in a strategic sequen
Keywords
Chat
Click to start Chat