EDBT/ICDT 2009 Proceedings

Tight results for clustering and summarizing data streams

Authors

Sudipto Guha (University Of Pennsylvania, USA)

Abstract

In this paper we investigate algorithms and lower bounds for summarization problems over a single pass data stream. In particular we focus on histogram construction and K-center clustering. We provide a simple framework that improves upon all previous algorithms on these problems in either the space bound, the approximation factor or running time. The framework uses a notion of “streamstrapping” where summaries created for the initial prefixes of the data are used as a proxy input to simulate replaying the algorithm from the start. We also prove the first non-trivial lower bounds for these problems and show that the stricter requirement that if an algorithm approximates the error of every bucket or every cluster produced by it, sufficiently well (this is true of all known upper bounds on these problems) would imply that these upper bounds are almost the best possible.

Session

ICDT Research Session 8: Streams, Data Mining, Complexity (Wednesday, March 25, 14:00—15:30)

EDBT/ICDT 2009 Joint Conference

Electronic Conference Proceedings

Tight results for clustering and summarizing data streams

Authors

Abstract

Session