EDBT/ICDT 2009 Proceedings

Flexible and Efficient Querying and Ranking on Hyperlinked Data Sources

Authors

Ramakrishna Varadarajan (Florida International University, USA)
Hector Rodriguez-Drumond (Department of Computer Science, Universidad Simon Bolivar, Caracas, Venezuela., Venezuela)
Vagelis Hristidis (Florida International University, USA)
Louiqa Raschid (Department of Computer Science, University of Maryland, College Park, MD 20742, USA)
Maria-Esther Vidal (Department of Computer Science, Universidad Simon Bolivar, Caracas, Venezuela., Venezuela)
Luis Daniel Ibáñez (Department of Computer Science, Universidad Simon Bolivar, Caracas, Venezuela., Venezuela)

Abstract

There has been an explosion of hyperlinked data in many domains, e.g., the biological Web. Expressive query languages and effective ranking techniques are required to convert this data into browsable knowledge. We propose the Graph Information Discovery (GID) framework to support sophisticated user queries in a rich web of annotated and hyperlinked data entries, where query answers need to be ranked in terms of some customized criteria, e.g., PageRank or ObjectRank. GID has a data model that includes a schema graph and a data graph, and an intuitive query interface. The GID framework allows users to easily formulate queries consisting of sequences of hard filters (selection predicates) and soft filters (ranking criteria); it can also be combined with other specialized graph query languages to enhance their ranking capabilities. GID queries have a well-defined semantics and are implemented by a set of physical operators, each of which produces a ranked result graph. We discuss rewriting opportunities to provide an efficient evaluation of GID queries. Soft filters are a key feature of GID and they are implemented using authority flow ranking techniques; these are query dependent rankings and are expensive to compute at runtime. We present approximate optimization techniques for GID soft filter queries based on the properties of random walks, and using novel path-length-bound and graph-sampling approximation techniques. We experimentally validate our optimization techniques on large biological and bibliographic datasets. Our techniques can produce high quality (Top-k) answers with up to an order of magnitude savings over the evaluation time of the exact solutions.

Session

EDBT Research Session 16: Heterogeneous & Distributed (Thursday, March 26, 9:00—10:30)

EDBT/ICDT 2009 Joint Conference

Electronic Conference Proceedings

Flexible and Efficient Querying and Ranking on Hyperlinked Data Sources

Authors

Abstract

Session