Architecture

Overview

libvroom implements a speculative multi-threaded two-pass algorithm for high-performance CSV parsing. The design is based on research from Chang et al. (SIGMOD 2019) combined with SIMD techniques from Langdale & Lemire’s simdjson.

The parser achieves high throughput by:

Using SIMD instructions to process 64 bytes at a time
Employing a speculative multi-threaded approach for parallel chunk processing
Using branchless state machines to minimize branch mispredictions
Leveraging carry-less multiplication (PCLMULQDQ/PMULL) for efficient quote parity tracking

Two-Pass Algorithm

The core parsing algorithm operates in two passes, inspired by Chang et al.’s speculative distributed CSV parsing approach.

flowchart LR
    A[Input File] --> B[First Pass:<br/>Find Safe Splits]
    B --> C[Second Pass:<br/>Build Index]
    C --> D[Field Positions]

The first pass divides the file into chunks and finds safe split points based on quote parity. The second pass then processes chunks in parallel using a SIMD-accelerated state machine.

First Pass: Quote Parity and Safe Division Points

The first pass scans the file to find safe chunk boundaries for parallel processing. The key challenge is handling quoted fields that may contain newline characters—a newline inside a quoted field is data, not a record boundary.

File: "name","description"\n"Alice","Hello\nWorld"\n"Bob","Hi"
                          ↑                       ↑
                     Safe boundary          Safe boundary
                   (even quote count)     (even quote count)

The algorithm tracks quote parity—whether we’re inside or outside a quoted field at any given position:

Even quote count: We’re outside quotes; newlines are record boundaries
Odd quote count: We’re inside quotes; newlines are part of field data

For each chunk, the first pass finds:

Output	Description	Use
`first_even_nl`	First newline at even quote count	Safe split point if preceding chunk has even quotes
`first_odd_nl`	First newline at odd quote count	Safe split point if preceding chunk has odd quotes
`n_quotes`	Total quote count in chunk	Determines which split point to use for next chunk

The cumulative quote count from all preceding chunks determines which split point (first_even_nl or first_odd_nl) to use at each chunk boundary.

Speculative Parsing

For scenarios where determining exact quote state is expensive, libvroom supports speculative parsing (Chang et al., SIGMOD 2019):

Divide the file into N chunks (one per thread)
Speculate about quote state at each boundary by scanning backwards
Parse chunks in parallel based on speculation
Verify and fall back to single-threaded if speculation fails

The backward scanning heuristic (get_quotation_state) examines up to 64KB before each chunk boundary, looking for patterns that definitively indicate quote state:

A quote followed by non-delimiter/quote/newline → inside quoted field
A non-delimiter/quote/newline followed by quote → outside quoted field

This approach achieves >98% success rate on real-world datasets (Chang et al.), with graceful fallback when speculation fails.

Second Pass: SIMD-Accelerated Field Indexing

The second pass builds an index of all field boundaries using a SIMD-accelerated state machine. This is where the bulk of parsing work occurs.

Algorithm (per 64-byte block):

Load 64 bytes into SIMD registers
Create bitmasks for quotes, delimiters, and newlines using parallel comparison
Compute quote mask using carry-less multiplication (identifies positions inside quotes)
Mask out delimiters/newlines that are inside quotes
Extract and store field boundary positions

// Pseudocode for second pass core loop
for each 64-byte block:
    quotes = SIMD_compare(block, '"')        // Find all quotes
    delims = SIMD_compare(block, ',')        // Find all delimiters
    newlines = SIMD_compare(block, '\n')     // Find all newlines

    inside_quote = CLMul(quotes, ~0ULL)      // Compute quote parity mask
    inside_quote ^= prev_iter_inside_quote   // Chain across blocks

    field_seps = (delims | newlines) & ~inside_quote  // Valid separators

    extract_and_store(field_seps, output_index)

State Machine

The parser uses a five-state finite state machine compliant with RFC 4180:

States: RECORD_START → FIELD_START → UNQUOTED_FIELD
                           ↓
                      QUOTED_FIELD → QUOTED_END

State transitions are determined by the character class: delimiter (,), quote ("), newline (\n), or other characters.

State Descriptions

State	Description	Valid Transitions
RECORD_START	At the beginning of a new record (row)	→ FIELD_START, QUOTED_FIELD, UNQUOTED_FIELD, or stay
FIELD_START	At the beginning of a new field (after comma)	→ FIELD_START (empty), QUOTED_FIELD, UNQUOTED_FIELD, RECORD_START
UNQUOTED_FIELD	Inside an unquoted field	→ FIELD_START, RECORD_START, or stay
QUOTED_FIELD	Inside a quoted field (all chars are data)	→ QUOTED_END on quote, otherwise stay
QUOTED_END	Just saw quote inside quoted field	→ QUOTED_FIELD (escaped quote), FIELD_START, RECORD_START, or ERROR

Branchless Implementation

For maximum performance, libvroom provides a branchless state machine implementation (branchless_state_machine.h) that uses lookup tables instead of switch statements:

// Character classification table: 256 bytes, O(1) lookup
uint8_t char_class[256];  // Maps char → {DELIMITER, QUOTE, NEWLINE, OTHER}

// Transition table: 5 states × 4 char classes = 20 bytes
PackedResult transitions[20];  // Maps (state, class) → (new_state, is_separator, error)

// Processing: two table lookups, no branches
CharClass cls = char_class[byte];
PackedResult result = transitions[state * 4 + cls];

This eliminates 90%+ of branches in the parsing hot path, significantly improving instruction-level parallelism.

SIMD Implementation

libvroom uses Google Highway (v1.3.0) for portable SIMD operations across different architectures:

Architecture	SIMD Support	Vector Width
x86-64	SSE4.2, AVX2, AVX-512	128-512 bits
ARM	NEON, SVE/SVE2	128-2048 bits
Fallback	Scalar operations	64 bits

Key SIMD Operations

Function	Description	SIMD Instruction
`fill_input`	Load 64 bytes into SIMD registers	`_mm256_loadu_si256` / `vld1q_u8`
`cmp_mask_against_input`	Compare all bytes against a character, producing a 64-bit mask	`_mm256_cmpeq_epi8` → `_mm256_movemask_epi8`
`find_quote_mask`	Compute quote parity mask via prefix XOR	`_mm_clmulepi64_si128` / `vmull_p64`

Carry-Less Multiplication for Quote Parity

The find_quote_mask function uses carry-less multiplication (CLMUL) to compute quote parity in constant time. This replaces what would otherwise be an O(64) serial XOR loop.

// Mathematical insight: XOR prefix sum can be computed via CLMUL
// quote_bits × 0xFFFFFFFFFFFFFFFF computes the running XOR
//
// Example: quote_bits = 0b00100100 (quotes at positions 2 and 5)
// Result:  0b00111100 (inside quote from position 2 to 5)

auto result = CLMulLower(quote_bits, ~0ULL);
quote_mask = result ^ prev_iter_inside_quote;  // Chain across blocks

This technique, pioneered by simdjson, provides approximately 28× speedup over scalar implementation for quote tracking.

Handling Quoted Fields at Chunk Boundaries

When parsing in parallel, quoted fields may span chunk boundaries. libvroom handles this through:

First pass: Each chunk reports both first_even_nl and first_odd_nl
Cumulative tracking: Running quote count determines which boundary to use
Safe division: Chunks start only at positions where quote state is known

Chunk 1: ...data,"quoted     Chunk 2: field",normal...
                  ↑                   ↑
            (odd quotes)         (even quotes)
            Inside quoted field! Safe to split here
            Use first_odd_nl

If a chunk’s first_even_nl or first_odd_nl is invalid (null_pos), the parser falls back to single-threaded mode for correctness.

Key Components

File	Purpose
`include/libvroom.h`	Main public API - Parser class, unified interface
`include/two_pass.h`	Core two-pass parsing algorithm with multi-threading
`include/streaming.h`	Streaming parser for memory-efficient large file processing
`include/simd_highway.h`	Portable SIMD operations using Google Highway
`include/branchless_state_machine.h`	Lookup table-based state machine (branch-free)
`include/error.h`	Error codes, severity levels, ErrorCollector class
`include/dialect.h`	CSV dialect detection (delimiter, quote char)
`include/encoding.h`	Encoding detection and transcoding (UTF-16, UTF-32)
`include/io_util.h`	File loading with SIMD-aligned padding (32+ bytes)
`include/mem_util.h`	Aligned memory allocation for SIMD
`include/libvroom_c.h`	C API wrapper for FFI bindings

Memory Layout

The parser produces an index structure with field boundary positions:

class index {
    uint64_t columns;      // Number of columns detected
    uint8_t n_threads;     // Thread count used (interleave stride)
    uint64_t* n_indexes;   // Index count per thread [n_threads]
    uint64_t* indexes;     // Interleaved field positions
};

Indexes are stored interleaved by thread for cache efficiency during parallel construction:

Thread 0: pos[0], pos[4], pos[8], ...
Thread 1: pos[1], pos[5], pos[9], ...
Thread 2: pos[2], pos[6], pos[10], ...
Thread 3: pos[3], pos[7], pos[11], ...

This interleaving ensures each thread writes to independent cache lines, avoiding false sharing.

Thread Safety

First pass: Each thread operates on independent chunk ranges (embarrassingly parallel)
Second pass: Threads write to interleaved positions—no overlap, no locks needed
Error collection: Thread-local ErrorCollector instances are merged after parsing

The design achieves lock-free parallelism by ensuring threads never write to the same memory locations.

Performance Considerations

Optimization	Implementation	Impact
SIMD Processing	64-byte blocks via Highway	Process 64 chars per iteration
Branchless Parsing	Lookup table state machine	Eliminate 90%+ branches
CLMUL Quote Tracking	Carry-less multiplication	28× faster than scalar XOR
Prefetching	`__builtin_prefetch` hints	Hide memory latency
Memory Alignment	64-byte aligned buffers	Optimal SIMD loads
Interleaved Output	Per-thread index positions	No false sharing
Speculative Parsing	Backward quote state scan	Enable parallelism

Chunk Sizing

Minimum: 64 bytes (ensures at least one SIMD block)
Recommended: 64KB–1MB for optimal parallelism/overhead balance
Fallback: Files smaller than 64 bytes per thread use single-threaded mode

References

The libvroom architecture draws from these foundational papers:

Chang, G., Li, Y., Eilebrecht, E., Chandramouli, B., & Kossmann, D. (2019). Speculative Distributed CSV Data Parsing for Big Data Analytics. SIGMOD ’19: Proceedings of the 2019 International Conference on Management of Data, pp. 883-899. https://doi.org/10.1145/3299869.3319898

Provides the theoretical foundation for speculative multi-threaded CSV parsing with quote-state tracking at chunk boundaries.
Langdale, G. & Lemire, D. (2019). Parsing Gigabytes of JSON per Second. The VLDB Journal, 28(6), pp. 941-960. https://doi.org/10.1007/s00778-019-00578-5

Introduces the two-stage SIMD parsing architecture, carry-less multiplication for quote parity, and branchless design patterns used throughout libvroom.
van den Burgh, G.J.J. & Nazabal, A. (2019). Wrangling Messy CSV Files by Detecting Row and Type Patterns. Data Mining and Knowledge Discovery, 33(4), pp. 1415-1441. https://doi.org/10.1007/s10618-019-00646-y

Provides the basis for libvroom’s dialect detection algorithm (CleverCSV approach).