Naga K. Govindaraju, Nikunj Raghuvanshi, and Dinesh Manocha
We present algorithms for fast quantile and frequency estimation in large data streams using graphics processor units (GPUs). We exploit the high computational power and memory bandwidth of graphics processors and present a novel sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and the construction of epsilon-approximate quantile and frequency summaries. Our overall algorithms for numerical statistics computation on data streams are deterministic, applicable to fixed or variable-sized sliding windows and use a limited memory footprint. We use the GPU as a co-processor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We have implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3.4 GHz Pentium 4 CPU and applied them to large data streams consisting of more than 100 million values. We have compared their performance against optimized implementations of prior CPU-based algorithms. Overall, our results demonstrate that the graphics processor available on a commodity computer system is an efficient stream-processor and a useful co-processor for mining data streams.
Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors in Proceedings of ACM SIGMOD 2005.
CB #3175, Department of Computer Science
University of North Carolina
Chapel Hill, NC 27599-3175
(919) 962-1749
geom@cs.unc.edu