External-Memory Streaming

Interaction with external-memory devices, such as hard disk drives (HDDs), is a significant challenge for modern applications due to the large delays incurred by random access. In this project, we develop and study a new model for external-memory applications called bowtie streaming. Our optimized I/O scheduling algorithms are able to maintain throughputs several times higher than the fastest prior methods.

Tuxedo

Tuxedo is a software suite containing the reference implementation of our bowtie streaming algorithms. It contains the following components:

  1. I/O scheduling algorithms capable of achieving theoretically-optimal (minimal) seek counts
  2. Dynamic algorithm for multi-pass bowtie optimization
  3. Hardware benchmarking tools for collecting single-pass and multi-pass rate tables
  4. The Block File System, a simulated filesystem that reduces native filesystem overhead during complex tasks
  5. The Hook System, a tool that can trace, process, and replay the I/Os of any Windows application

The software package also implements an efficient external-memory distribution sort, TuxedoSort, which has been tested on single-machine sorts up to 40TB using only 20GB of memory. On this challenging task, TuxedoSort was over 8x faster than the next-fastest system, finishing the sort in only 24 hours. Note that the RAID system on our test machine has 4 GB/s sequential throughput, meaning that the theoretical minimum time just to read and write 40TB of data is roughly six hours.

Tuxedo's source code can be requested via email.

Bucket Game (D1)

This is a simulation of the bucket game introduced in our paper on bowtie streaming. It demonstrates the distribute-from-file case, or D1 in the paper's notation.

Publications

On High-Latency Bowtie Data Streaming

Gabriel Stella, Dmitri Loguinov. IEEE BigData, December 2022.