![]() |
Algorithms for large data setsDate: 2015-10-07; view: 515. Processing large geometric data sets How can we handle large data sets? Powerful computers with large memories suffice for those who have them (and are often responsible for producing the data sets). To make large data sets useful to the wider audience that has commodity processors, however, we need algorithms that use a small amount of memory wisely. Here we review general approaches to algorithms for large geometric data sets, and the literature on computing large Delaunay triangulations. Several types of algorithms are used to process large geometric data sets: divide-and-conquer algorithms, which cut a problem into small subproblems that can be solved independently; cache efficient algorithms, which cooperate with the hardware's memory hierarchy (caches and virtual memory); external memory algorithms, which exercise control over where, when, and how data structures are stored on disk (rather than trusting the virtual memory); and streaming algorithms, which sequentially read a stream of data (usually once, perhaps in several passes) and retain only a small portion of the information in memory. All of these algorithms try to exploit or create spatial coherence. Divide-and-conquer algorithms for huge data sets are, for some problems, difficult to design: they often require ingenious algorithms to choose the cuts, necessitate tedious programming to communicate across the cuts, or suffer from poor-quality results near the cuts. For Delaunay triangulations, the very act of choosing cuts so that no further communication is needed requires a convex hull algorithm that itself can process huge data sets [Blelloch et al. 1999]. Cache-efficient algorithms (which often also cooperate well with virtual memory) fall into two categories. Some software is optimized for a particular cache architecture ‑ a well-known example is BLAS (the Basic Linear Algebra Subprograms), optimized by most microprocessor vendors for their architectures. Some software is cache-oblivious, designed to cooperate well with any cache or virtual memory, regardless of the details of its architecture. This category includes heuristics for cache-oblivious data layouts that do well in practice [Yoon et al. 2005], and cache-oblivious algorithms that offer guaranteed bounds on the amount of traffic between cache and memory [Kumar 2003] (and sometimes do well in practice). External memory algorithms use disks for temporary storage of data structures that do not fit in memory, and explicitly control data movement and data layout on disk with the goal of minimizing the number of disk accesses [Vitter 2001]. Like cache-oblivious algorithms, external memory algorithms have received a lot of attention from theoreticians, who give provable bounds on the number of disk accesses their algorithms perform. Most of these algorithms build sophisticated data structures on disk, notably B-trees. Streaming, the approach we advocate here, differs from external memory algorithms in that nothing is temporarily stored in external memory. The disk is used only for input and output. An algorithm makes one or a few sequential passes over the data and restricts computations to those parts that are in memory. It can neither backtrack nor store more than a small fraction of the stream, so it needs a mechanism to decide what parts of the data to retain and for how long, and where it can safely complete computations. Some online algorithms can be remarkably more effective if the stream includes a small amount of information about the «future» of the stream [Karp 1992]. Representations for streamingmeshes [Isenburg and Gumhold 2003; Isenburg et al. 2003; Isenburg and Lindstrom 2005] contain not only points, triangles, and tetrahedra, but also finalization tags that certify when a topological entity is seen for the last time. Finalization tags permit a geometric algorithm to output partial results and discard associated information, freeing room in memory for more data to stream in.
|