flowchart LR
A[Input File] --> B[First Pass:<br/>Find Safe Splits]
B --> C[Second Pass:<br/>Build Index]
C --> D[Field Positions]
Architecture
Overview
libvroom implements a speculative multi-threaded two-pass algorithm for high-performance CSV parsing. The design is based on research from Chang et al. (SIGMOD 2019) combined with SIMD techniques from Langdale & Lemire’s simdjson.
The parser achieves high throughput by:
- Using SIMD instructions to process 64 bytes at a time
- Employing a speculative multi-threaded approach for parallel chunk processing
- Using branchless state machines to minimize branch mispredictions
- Leveraging carry-less multiplication (PCLMULQDQ/PMULL) for efficient quote parity tracking
Two-Pass Algorithm
The core parsing algorithm operates in two passes, inspired by Chang et al.’s speculative distributed CSV parsing approach.
The first pass divides the file into chunks and finds safe split points based on quote parity. The second pass then processes chunks in parallel using a SIMD-accelerated state machine.
First Pass: Quote Parity and Safe Division Points
The first pass scans the file to find safe chunk boundaries for parallel processing. The key challenge is handling quoted fields that may contain newline characters—a newline inside a quoted field is data, not a record boundary.
File: "name","description"\n"Alice","Hello\nWorld"\n"Bob","Hi"
↑ ↑
Safe boundary Safe boundary
(even quote count) (even quote count)
The algorithm tracks quote parity—whether we’re inside or outside a quoted field at any given position:
- Even quote count: We’re outside quotes; newlines are record boundaries
- Odd quote count: We’re inside quotes; newlines are part of field data
For each chunk, the first pass finds:
| Output | Description | Use |
|---|---|---|
first_even_nl |
First newline at even quote count | Safe split point if preceding chunk has even quotes |
first_odd_nl |
First newline at odd quote count | Safe split point if preceding chunk has odd quotes |
n_quotes |
Total quote count in chunk | Determines which split point to use for next chunk |
The cumulative quote count from all preceding chunks determines which split point (first_even_nl or first_odd_nl) to use at each chunk boundary.
Speculative Parsing
For scenarios where determining exact quote state is expensive, libvroom supports speculative parsing (Chang et al., SIGMOD 2019):
- Divide the file into N chunks (one per thread)
- Speculate about quote state at each boundary by scanning backwards
- Parse chunks in parallel based on speculation
- Verify and fall back to single-threaded if speculation fails
The backward scanning heuristic (get_quotation_state) examines up to 64KB before each chunk boundary, looking for patterns that definitively indicate quote state:
- A quote followed by non-delimiter/quote/newline → inside quoted field
- A non-delimiter/quote/newline followed by quote → outside quoted field
This approach achieves >98% success rate on real-world datasets (Chang et al.), with graceful fallback when speculation fails.
Second Pass: SIMD-Accelerated Field Indexing
The second pass builds an index of all field boundaries using a SIMD-accelerated state machine. This is where the bulk of parsing work occurs.
Algorithm (per 64-byte block):
- Load 64 bytes into SIMD registers
- Create bitmasks for quotes, delimiters, and newlines using parallel comparison
- Compute quote mask using carry-less multiplication (identifies positions inside quotes)
- Mask out delimiters/newlines that are inside quotes
- Extract and store field boundary positions
// Pseudocode for second pass core loop
for each 64-byte block:
quotes = SIMD_compare(block, '"') // Find all quotes
delims = SIMD_compare(block, ',') // Find all delimiters
newlines = SIMD_compare(block, '\n') // Find all newlines
inside_quote = CLMul(quotes, ~0ULL) // Compute quote parity mask
inside_quote ^= prev_iter_inside_quote // Chain across blocks
field_seps = (delims | newlines) & ~inside_quote // Valid separators
extract_and_store(field_seps, output_index)State Machine
The parser uses a five-state finite state machine compliant with RFC 4180:
States: RECORD_START → FIELD_START → UNQUOTED_FIELD
↓
QUOTED_FIELD → QUOTED_END
State transitions are determined by the character class: delimiter (,), quote ("), newline (\n), or other characters.
State Descriptions
| State | Description | Valid Transitions |
|---|---|---|
| RECORD_START | At the beginning of a new record (row) | → FIELD_START, QUOTED_FIELD, UNQUOTED_FIELD, or stay |
| FIELD_START | At the beginning of a new field (after comma) | → FIELD_START (empty), QUOTED_FIELD, UNQUOTED_FIELD, RECORD_START |
| UNQUOTED_FIELD | Inside an unquoted field | → FIELD_START, RECORD_START, or stay |
| QUOTED_FIELD | Inside a quoted field (all chars are data) | → QUOTED_END on quote, otherwise stay |
| QUOTED_END | Just saw quote inside quoted field | → QUOTED_FIELD (escaped quote), FIELD_START, RECORD_START, or ERROR |
Branchless Implementation
For maximum performance, libvroom provides a branchless state machine implementation (branchless_state_machine.h) that uses lookup tables instead of switch statements:
// Character classification table: 256 bytes, O(1) lookup
uint8_t char_class[256]; // Maps char → {DELIMITER, QUOTE, NEWLINE, OTHER}
// Transition table: 5 states × 4 char classes = 20 bytes
PackedResult transitions[20]; // Maps (state, class) → (new_state, is_separator, error)
// Processing: two table lookups, no branches
CharClass cls = char_class[byte];
PackedResult result = transitions[state * 4 + cls];This eliminates 90%+ of branches in the parsing hot path, significantly improving instruction-level parallelism.
SIMD Implementation
libvroom uses Google Highway (v1.3.0) for portable SIMD operations across different architectures:
| Architecture | SIMD Support | Vector Width |
|---|---|---|
| x86-64 | SSE4.2, AVX2, AVX-512 | 128-512 bits |
| ARM | NEON, SVE/SVE2 | 128-2048 bits |
| Fallback | Scalar operations | 64 bits |
Key SIMD Operations
| Function | Description | SIMD Instruction |
|---|---|---|
fill_input |
Load 64 bytes into SIMD registers | _mm256_loadu_si256 / vld1q_u8 |
cmp_mask_against_input |
Compare all bytes against a character, producing a 64-bit mask | _mm256_cmpeq_epi8 → _mm256_movemask_epi8 |
find_quote_mask |
Compute quote parity mask via prefix XOR | _mm_clmulepi64_si128 / vmull_p64 |
Carry-Less Multiplication for Quote Parity
The find_quote_mask function uses carry-less multiplication (CLMUL) to compute quote parity in constant time. This replaces what would otherwise be an O(64) serial XOR loop.
// Mathematical insight: XOR prefix sum can be computed via CLMUL
// quote_bits × 0xFFFFFFFFFFFFFFFF computes the running XOR
//
// Example: quote_bits = 0b00100100 (quotes at positions 2 and 5)
// Result: 0b00111100 (inside quote from position 2 to 5)
auto result = CLMulLower(quote_bits, ~0ULL);
quote_mask = result ^ prev_iter_inside_quote; // Chain across blocksThis technique, pioneered by simdjson, provides approximately 28× speedup over scalar implementation for quote tracking.
Handling Quoted Fields at Chunk Boundaries
When parsing in parallel, quoted fields may span chunk boundaries. libvroom handles this through:
- First pass: Each chunk reports both
first_even_nlandfirst_odd_nl - Cumulative tracking: Running quote count determines which boundary to use
- Safe division: Chunks start only at positions where quote state is known
Chunk 1: ...data,"quoted Chunk 2: field",normal...
↑ ↑
(odd quotes) (even quotes)
Inside quoted field! Safe to split here
Use first_odd_nl
If a chunk’s first_even_nl or first_odd_nl is invalid (null_pos), the parser falls back to single-threaded mode for correctness.
Key Components
| File | Purpose |
|---|---|
include/libvroom.h |
Main public API - Parser class, unified interface |
include/two_pass.h |
Core two-pass parsing algorithm with multi-threading |
include/streaming.h |
Streaming parser for memory-efficient large file processing |
include/simd_highway.h |
Portable SIMD operations using Google Highway |
include/branchless_state_machine.h |
Lookup table-based state machine (branch-free) |
include/error.h |
Error codes, severity levels, ErrorCollector class |
include/dialect.h |
CSV dialect detection (delimiter, quote char) |
include/encoding.h |
Encoding detection and transcoding (UTF-16, UTF-32) |
include/io_util.h |
File loading with SIMD-aligned padding (32+ bytes) |
include/mem_util.h |
Aligned memory allocation for SIMD |
include/libvroom_c.h |
C API wrapper for FFI bindings |
Memory Layout
The parser produces an index structure with field boundary positions:
class index {
uint64_t columns; // Number of columns detected
uint8_t n_threads; // Thread count used (interleave stride)
uint64_t* n_indexes; // Index count per thread [n_threads]
uint64_t* indexes; // Interleaved field positions
};Indexes are stored interleaved by thread for cache efficiency during parallel construction:
Thread 0: pos[0], pos[4], pos[8], ...
Thread 1: pos[1], pos[5], pos[9], ...
Thread 2: pos[2], pos[6], pos[10], ...
Thread 3: pos[3], pos[7], pos[11], ...
This interleaving ensures each thread writes to independent cache lines, avoiding false sharing.
Thread Safety
- First pass: Each thread operates on independent chunk ranges (embarrassingly parallel)
- Second pass: Threads write to interleaved positions—no overlap, no locks needed
- Error collection: Thread-local ErrorCollector instances are merged after parsing
The design achieves lock-free parallelism by ensuring threads never write to the same memory locations.
Performance Considerations
| Optimization | Implementation | Impact |
|---|---|---|
| SIMD Processing | 64-byte blocks via Highway | Process 64 chars per iteration |
| Branchless Parsing | Lookup table state machine | Eliminate 90%+ branches |
| CLMUL Quote Tracking | Carry-less multiplication | 28× faster than scalar XOR |
| Prefetching | __builtin_prefetch hints |
Hide memory latency |
| Memory Alignment | 64-byte aligned buffers | Optimal SIMD loads |
| Interleaved Output | Per-thread index positions | No false sharing |
| Speculative Parsing | Backward quote state scan | Enable parallelism |
Chunk Sizing
- Minimum: 64 bytes (ensures at least one SIMD block)
- Recommended: 64KB–1MB for optimal parallelism/overhead balance
- Fallback: Files smaller than 64 bytes per thread use single-threaded mode
References
The libvroom architecture draws from these foundational papers:
Chang, G., Li, Y., Eilebrecht, E., Chandramouli, B., & Kossmann, D. (2019). Speculative Distributed CSV Data Parsing for Big Data Analytics. SIGMOD ’19: Proceedings of the 2019 International Conference on Management of Data, pp. 883-899. https://doi.org/10.1145/3299869.3319898
Provides the theoretical foundation for speculative multi-threaded CSV parsing with quote-state tracking at chunk boundaries.
Langdale, G. & Lemire, D. (2019). Parsing Gigabytes of JSON per Second. The VLDB Journal, 28(6), pp. 941-960. https://doi.org/10.1007/s00778-019-00578-5
Introduces the two-stage SIMD parsing architecture, carry-less multiplication for quote parity, and branchless design patterns used throughout libvroom.
van den Burgh, G.J.J. & Nazabal, A. (2019). Wrangling Messy CSV Files by Detecting Row and Type Patterns. Data Mining and Knowledge Discovery, 33(4), pp. 1415-1441. https://doi.org/10.1007/s10618-019-00646-y
Provides the basis for libvroom’s dialect detection algorithm (CleverCSV approach).