Cross-modal Approximate Nearest Neighbor Search (ANNS) is crucial for applications like search engines and recommendation systems. However, existing vector search indexes often struggle with poor search efficiency or slow index construction when handling cross-modal ANNS queries. To overcome these challenges, we introduce ParaGraph, an algorithm-hardware co-designed index that bridges search and construction efficiency for cross-modal ANNS. ParaGraph employs fast multi-round top-m projection and batched search-and-refine techniques for index construction. Additionally, it leverages modern heterogeneous hardware architectures by distributing computations across GPU and CPU, along with in-depth optimizations to maximize performance. Compared to the state-of-the-art cross-modal ANNS index, ParaGraph achieves 4.1X to 4.9X speedup in index construction, 50% smaller index size without compromising the search efficiency.
@inproceedings{damon25paragraph,author={Yang, Yuxiang and Chen, Shiwen and Deng, Yangshen and Tang, Bo},title={ParaGraph: Accelerating Graph Indexing through GPU-CPU Parallel Processing for Efficient Cross-modal ANNS},year={2025},publisher={Association for Computing Machinery},address={New York, NY, USA},booktitle={Proceedings of the 21th International Workshop on Data Management on New Hardware},series={DaMoN '25}}
SIGMOD 2025
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
Yangshen
Deng, Zhengxin
You
, Long
Xiang
, Qilong
Li
, and
12 more authors
In In Companion of the 2025 International Conference on Management of Data , 2025
AlayaDB is a cutting-edge vector database system natively architected for efficient and effective long-context inference for Large Language Models (LLMs) at AlayaDB.AI. Specifically, it decouples the KV cache and attention computation from the LLM inference systems, and encapsulates them into a novel vector database system.
For the Model as a Service providers (MaaS), AlayaDB consumes fewer hardware resources and offers higher generation quality for various workloads with different kinds of Service Level Objectives (SLOs), when comparing with the existing alternative solutions (e.g., KV cache disaggregation, retrieval-based sparse attention). The crux of AlayaDB is that it abstracts the attention computation and cache management for LLM inference into a query processing procedure, and optimizes the performance via a native query optimizer. In this work, we demonstrate the effectiveness of AlayaDB via (i) three use cases from our industry partners, and (ii) extensive experimental results on LLM inference benchmarks.
Improving the performance of GPU query processing is a well-studied problem in database community. However, its performance is still unsatisfactory due to the low utilization of GPU memory bandwidth. In the literature, employing software prefetching techniques to improve the bandwidth utilization is a common practice in CPU database as it overlaps computation cost and memory access latency. However, it was ignored by GPU database even though the software prefetching ability has been provided by modern GPU architecture (i.e., from NVIDIA Ampere).In order to investigate the effectiveness of software prefetching techniques on GPU query processing, we implement four software prefetching algorithms on GPU, i.e., Group Prefetch (GP), Software-Pipelined Prefetch (SPP), Asynchronous Memory Access Chaining (AMAC) and Interleaved Multi-Vectorizing (IMV) in the work. We then adapt them on hash join probe and BTree search tasks with a suite of optimizations. Last, we conduct comprehensive experiments and evaluate the performance of them. The results confirm the superiority of software prefetching techniques on GPU query processing. Specifically, they can achieve up to 1.19X speedup on hash join probe and 1.31X speedup on BTree search when compared with the implementations without software prefetching.
@inproceedings{damon24gpuprefetch,author={Deng, Yangshen and Chen, Shiwen and Hong, Zhaoyang and Tang, Bo},title={How Does Software Prefetching Work on GPU Query Processing?},year={2024},isbn={9798400706677},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3662010.3663445},doi={10.1145/3662010.3663445},booktitle={Proceedings of the 20th International Workshop on Data Management on New Hardware},articleno={5},numpages={9},location={<conf-loc>, <city>Santiago</city>, <state>AA</state>, <country>Chile</country>, </conf-loc>},series={DaMoN '24}}
Merkle Patricia Trie (MPT) is a type of trie structure that offers efficient lookup and insert operators for immutable data systems that require multi-version access and tamper-evident controls, such as blockchains and verifiable databases. The performance of these systems is critically dependent on the throughput of the underlying index structure MPT. In this paper, we present a novel approach to accelerate MPT by leveraging the massive parallelism of GPU. However, achieving it is challenging as (i) lock-free data structures are difficult to implement and (ii) traditional fine-grained locking does not scale on GPU.
To address them, we first analyze the technical challenges of accelerating MPT via GPU, including node splitting conflicts and hash computing conflicts caused by parallel insert operations. We then propose a lock-free algorithm PhaseNU and a lock-based algorithm LockNU on GPU to resolve the node splitting conflict. We also devise a decision model for users to choose the proper one for different workloads. We next propose a GPU-based hash compute algorithm PhaseHC to avoid hash computing conflicts. Last, we demonstrate the effectiveness of our proposed techniques by: (i) integrating them into both the real-world blockchain system Geth and verifiable database LedgerDB, and demonstrating its superiority with corresponding workloads; and (ii) conducting extensive experimental studies on two real-world datasets and one synthetic dataset. Our proposed solutions significantly outperform the deployed MPT solution in Geth in all datasets.
@article{vldb24mpt,author={Deng, Yangshen and Yan, Muxi and Tang, Bo},title={Accelerating Merkle Patricia Trie with GPU},year={2024},publisher={VLDB Endowment},volume={17},number={8},issn={2150-8097},url={https://www.vldb.org/pvldb/vol17/p1856-tang.pdf},doi={10.14778/3659437.3659443},journal={Proc. VLDB Endow.},pages={1856-1869}}
As a popular distributed data warehouse system, Apache Hive has been widely used for big data analytics in many organizations. Meanwhile, exploiting the massive parallelism of GPU to accelerate online analytical processing (OLAP) has been extensively explored in the database community. In this paper, we present GHive, which enhances CPU-based Hive via CPU-GPU heterogeneous computing. GHive is designed for the business intelligence applications and provides the same API as Hive for compatibility. To run SQL queries jointly on both CPU and GPU, GHive comes with three key techniques: (i) a novel data model gTable, which is column-based and enables efficient data movement between CPU memory and GPU memory; (ii) a GPU-based operator library Panda, which provides a complete set of SQL operators with extensively optimized GPU implementations; (iii) a hardware-aware MapReduce job placement scheme, which puts jobs judiciously on either GPU or CPU via a cost-based approach. In the experiments, we observe that GHive outperforms Hive in both query processing speed and operating expense on the Star Schema Benchmark (SSB).
@inproceedings{socc22ghive,author={Liu, Haotian and Tang, Bo and Zhang, Jiashu and Deng, Yangshen and Yan, Xiao and Zheng, Xinying and Shen, Qiaomu and Zeng, Dan and Mao, Zunyao and Zhang, Chaozu and You, Zhengxin and Wang, Zhihao and Jiang, Runzhe and Wang, Fang and Yiu, Man Lung and Li, Huan and Han, Mingji and Li, Qian and Luo, Zhenghai},title={GHive: Accelerating Analytical Query
Processing in Apache Hive via CPU-GPU Heterogeneous Computing},year={2022},isbn={9781450394147},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3542929.3563503},doi={10.1145/3542929.3563503},booktitle={Proceedings of the 13th Symposium on Cloud Computing},pages={158–172},numpages={15},location={San Francisco, California},series={SoCC '22}}
As a distributed, fault-tolerant data warehouse system for large-scale data analytics, Apache Hive has been used for various applications in many organizations (e.g., Facebook, Amazon, and Huawei). Exploiting the large degrees of parallelism of GPU to improve the performance of online analytical processing (OLAP) in database system is a common practice in the industry. Meanwhile, it is a common practice to exploit the large degrees of parallelism of GPU to improve the performance of online analytical processing (OLAP) in database systems. This demo presents GHive, which enables Apache Hive to accelerate OLAP queries by jointly utilizing CPU and GPU in intelligent and efficient ways. The takeaways for SIGMOD attendees include: (1) the superior performance of GHive compared with vanilla Hive that only uses CPU; (2) intuitive visualizations of execution statistics for Hive and GHive to understand where the acceleration of GHive comes from; (3) detailed profiling of the time taken by each operator on CPU and GPU to show the advantages of GPU execution.
@inproceedings{sigmod22ghive,author={Liu, Haotian and Tang, Bo and Zhang, Jiashu and Deng, Yangshen and Zheng, Xinying and Shen, Qiaomu and Yan, Xiao and Zeng, Dan and Mao, Zunyao and Zhang, Chaozu and You, Zhengxin and Wang, Zhihao and Jiang, Runzhe and Wang, Fang and Yiu, Man Lung and Li, Huan and Han, Mingji and Li, Qian and Luo, Zhenghai},title={GHive: A Demonstration of GPU-Accelerated Query Processing in Apache Hive},year={2022},isbn={9781450392495},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3514221.3520166},doi={10.1145/3514221.3520166},booktitle={Proceedings of the 2022 International Conference on Management of Data},pages={2417–2420},numpages={4},keywords={OLAP, CPU-GPU co-processing, Apache Hive},location={Philadelphia, PA, USA},series={SIGMOD '22}}