Approximate Substring Selectivity Estimation
Authors
- Hongrae Lee (University of British Columbia, Canada)
- Raymond Ng (University of British Columbia, Canada)
- Kyuseok Shim (Seoul National University, Korea)
Abstract
We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for the similarity between a pair of strings. Based on information stored in an extended N-gram table, we propose two estimation algorithms, MOF and LBS for the task. The latter extends the former with ideas from set hashing signatures. The experimental results show that MOF is a light-weight algorithm that gives fairly accurate estimations. However, if more space is available, LBS can give better accuracy than MOF and other baseline methods. Next, we extend the proposed solution to other similarity predicates, SQL LIKE operator and Jaccard similarity.
Session
EDBT Research Session 23: Information Retrieval (Thursday, March 26, 14:00—15:30)