418-final

Project Proposal: Parallel Limit Order Book Simulation

Irene Liu (irenel), Lillian Yu (lyu2)
15-418 – Spring 2026

🔙 Back to Home


URL

landing page: https://leeleeian.github.io/418-final/


Summary

We are going to build a parallel simulation of a limit order book (LOB) to study how different synchronization strategies impact scalability under realistic financial workloads. Our implementation will explore multi-core CPU parallelism using threading and synchronization primitives.


Background

A limit order book is the core data structure used by financial exchanges to match buy and sell orders. Orders are processed according to price-time priority, meaning that better-priced orders are matched first, and among equal prices, earlier orders take precedence.

To maintain this priority, the system typically maintains two ordered structures: a Bid book (buyers, sorted descending by price) and an Ask book (sellers, sorted ascending by price). Each of these structures is often implemented as a sparse list of “price levels” where each price level contains a FIFO (to account for time precedence) linked list of individual orders.

The LOB will also have three main operations: inserting new orders, cancelling existing orders, and an internal operation to fill orders if there are any bids and asks with matching prices.

Order books are used for exchanges that process millions of orders per second. The high volume of data, as well as its time sensitivity, make parallelizing LOBs to optimize performance, a very useful task. Moreover, there are many parts of this system that have opportunities for parallelism. For example, operations on different tickers will not affect one another, since they are on different order books, making them very data-parallel. Additionally, parsing the messages associated with each order (placement or cancellation) could be pipelined to hide latency.

However within each ticker, the matching engine introduces strong dependencies that limit parallelism. For example, a single, but large, market order can lead to many matches across multiple price levels, leading to a long read-modify-write chain that alters the global state. Similarly, since deterministic ordering of messages is legally required by exchanges, it is pertinent that order inserts and cancellations lead to safe and synchronized access of shared memory structures. This makes the LOB a compelling case study for understanding the tradeoffs between concurrency and correctness.


The Challenge

The primary challenge lies in maintaining correctness while introducing parallelism.

Dependencies: The main dependency we must navigate is that matching bids and asks depends on a strict price-time priority of orders that must be maintained across a globally shared state. As such, it is important to maintain a deterministic ordering, and good synchronization of all the processors.

Constraints and System Mapping: Mapping this workload is particularly challenging because most trading activity occurs near the best bid/ask, leading to potential for heavy contention and/or uneven load balancing.

Memory Access Characteristics: The workload has irregular memory access patterns and bad spatial locality because the order book contains pointers for orders within a price level, and threads must access this shared mutable structure to perform all of the LOB operations. If the bid-ask spread changes dramatically, we lose spatial locality if we choose to order data by price (which makes sense intuitively to do so).

Communication to Computation Ratio: The communication-to-computation ratio may be high since the actual computation mainly only involves comparing integer prices and subtracting order quantities, but much communication is required to safely update the shared mutable state with the results of this computation.

Contention and Synchronization Overhead: Since there are many small operations and a large number of concurrent agents in certain memory location (associated with the best bid/ask), having a single lock can lead to a lot of contention. Fine-grained locking could be a good way to address this, however that would lead to additional synchronization overhead from lock acquisation and cache coherence traffic, particularly if there is false sharing with adjacent price levels sitting on the same cache line. As such, there is opportunity to analyze the tradeoff between concurrency and synchronization overhead.

Overall, naive approaches such as coarse-grained locking severely limit scalability while fine-grained locking and lock-free data structures introduce overhead from synchronization, retries, and cache coherence effects. Additionally, workload skew exacerbates contention at hot spots. As such, we aim to explore whether restructuring computation, i.e. through batching, agent-level parallelism, and/or partitioning, can reduce synchronization frequency and improve throughput.


Resources

We will implement the system in C++ using multithreading (std::thread or pthreads).

We will use:

We will build from scratch but reference:


Goals and Deliverables

Plan to Achieve

Hope to Achieve

High Reach Goals (with very low probability)

Evaluation Questions


Platform Choice

We choose C++ (particularly std::thread or OpenMP) as our language of choice, because LOBs rely heavily on shared mutable state, branching, and low-latency pointer traversal. C++ provides fine-grained control over memory models, atomics, and synchronization primitives that allow for such optimizations on lock-free or fine-grained locking data structures.

We choose to use multi-core CPUs (the GHC lab machines) as our computer of choice because order matching is very data-dependent, and may lead to lots of divergent execution. Particularly, one thread processing an order with an existing price match in the LOB may execute immediately while another thread may have to insert its order into the book, which could lead to a pointer traversal. These drastically different possible paths of execution would perform poorly on a GPU, since they prevent good SIMD utilization due to the irregular control flow they bring (i.e. control flow divergence within same warp).


Schedule

Milestone (April 14):


🔙 Back to Home