Architecture

Overview

libvroom implements a speculative multi-threaded two-pass algorithm for high-performance CSV parsing. The design is based on research from Chang et al. (SIGMOD 2019) combined with SIMD techniques from Langdale & Lemire’s simdjson.

The parser achieves high throughput by:

  1. Using SIMD instructions to process 64 bytes at a time
  2. Employing a speculative multi-threaded approach for parallel chunk processing
  3. Using branchless state machines to minimize branch mispredictions
  4. Leveraging carry-less multiplication (PCLMULQDQ/PMULL) for efficient quote parity tracking

Two-Pass Algorithm

The core parsing algorithm operates in two passes, inspired by Chang et al.’s speculative distributed CSV parsing approach.

flowchart LR
    A[Input File] --> B[First Pass:<br/>Find Safe Splits]
    B --> C[Second Pass:<br/>Build Index]
    C --> D[Field Positions]

The first pass divides the file into chunks and finds safe split points based on quote parity. The second pass then processes chunks in parallel using a SIMD-accelerated state machine.

First Pass: Quote Parity and Safe Division Points

The first pass scans the file to find safe chunk boundaries for parallel processing. The key challenge is handling quoted fields that may contain newline characters—a newline inside a quoted field is data, not a record boundary.

File: "name","description"\n"Alice","Hello\nWorld"\n"Bob","Hi"
                          ↑                       ↑
                     Safe boundary          Safe boundary
                   (even quote count)     (even quote count)

The algorithm tracks quote parity—whether we’re inside or outside a quoted field at any given position:

  • Even quote count: We’re outside quotes; newlines are record boundaries
  • Odd quote count: We’re inside quotes; newlines are part of field data

For each chunk, the first pass finds:

Output Description Use
first_even_nl First newline at even quote count Safe split point if preceding chunk has even quotes
first_odd_nl First newline at odd quote count Safe split point if preceding chunk has odd quotes
n_quotes Total quote count in chunk Determines which split point to use for next chunk

The cumulative quote count from all preceding chunks determines which split point (first_even_nl or first_odd_nl) to use at each chunk boundary.

Speculative Parsing

For scenarios where determining exact quote state is expensive, libvroom supports speculative parsing (Chang et al., SIGMOD 2019):

  1. Divide the file into N chunks (one per thread)
  2. Speculate about quote state at each boundary by scanning backwards
  3. Parse chunks in parallel based on speculation
  4. Verify and fall back to single-threaded if speculation fails

The backward scanning heuristic (get_quotation_state) examines up to 64KB before each chunk boundary, looking for patterns that definitively indicate quote state:

  • A quote followed by non-delimiter/quote/newline → inside quoted field
  • A non-delimiter/quote/newline followed by quote → outside quoted field

This approach achieves >98% success rate on real-world datasets (Chang et al.), with graceful fallback when speculation fails.

Second Pass: SIMD-Accelerated Field Indexing

The second pass builds an index of all field boundaries using a SIMD-accelerated state machine. This is where the bulk of parsing work occurs.

Algorithm (per 64-byte block):

  1. Load 64 bytes into SIMD registers
  2. Create bitmasks for quotes, delimiters, and newlines using parallel comparison
  3. Compute quote mask using carry-less multiplication (identifies positions inside quotes)
  4. Mask out delimiters/newlines that are inside quotes
  5. Extract and store field boundary positions
// Pseudocode for second pass core loop
for each 64-byte block:
    quotes = SIMD_compare(block, '"')        // Find all quotes
    delims = SIMD_compare(block, ',')        // Find all delimiters
    newlines = SIMD_compare(block, '\n')     // Find all newlines

    inside_quote = CLMul(quotes, ~0ULL)      // Compute quote parity mask
    inside_quote ^= prev_iter_inside_quote   // Chain across blocks

    field_seps = (delims | newlines) & ~inside_quote  // Valid separators

    extract_and_store(field_seps, output_index)

State Machine

The parser uses a five-state finite state machine compliant with RFC 4180:

States: RECORD_START → FIELD_START → UNQUOTED_FIELD
                           ↓
                      QUOTED_FIELD → QUOTED_END

State transitions are determined by the character class: delimiter (,), quote ("), newline (\n), or other characters.

State Descriptions

State Description Valid Transitions
RECORD_START At the beginning of a new record (row) → FIELD_START, QUOTED_FIELD, UNQUOTED_FIELD, or stay
FIELD_START At the beginning of a new field (after comma) → FIELD_START (empty), QUOTED_FIELD, UNQUOTED_FIELD, RECORD_START
UNQUOTED_FIELD Inside an unquoted field → FIELD_START, RECORD_START, or stay
QUOTED_FIELD Inside a quoted field (all chars are data) → QUOTED_END on quote, otherwise stay
QUOTED_END Just saw quote inside quoted field → QUOTED_FIELD (escaped quote), FIELD_START, RECORD_START, or ERROR

Branchless Implementation

For maximum performance, libvroom provides a branchless state machine implementation (branchless_state_machine.h) that uses lookup tables instead of switch statements:

// Character classification table: 256 bytes, O(1) lookup
uint8_t char_class[256];  // Maps char → {DELIMITER, QUOTE, NEWLINE, OTHER}

// Transition table: 5 states × 4 char classes = 20 bytes
PackedResult transitions[20];  // Maps (state, class) → (new_state, is_separator, error)

// Processing: two table lookups, no branches
CharClass cls = char_class[byte];
PackedResult result = transitions[state * 4 + cls];

This eliminates 90%+ of branches in the parsing hot path, significantly improving instruction-level parallelism.

SIMD Implementation

libvroom uses Google Highway (v1.3.0) for portable SIMD operations across different architectures:

Architecture SIMD Support Vector Width
x86-64 SSE4.2, AVX2, AVX-512 128-512 bits
ARM NEON, SVE/SVE2 128-2048 bits
Fallback Scalar operations 64 bits

Key SIMD Operations

Function Description SIMD Instruction
fill_input Load 64 bytes into SIMD registers _mm256_loadu_si256 / vld1q_u8
cmp_mask_against_input Compare all bytes against a character, producing a 64-bit mask _mm256_cmpeq_epi8_mm256_movemask_epi8
find_quote_mask Compute quote parity mask via prefix XOR _mm_clmulepi64_si128 / vmull_p64

Carry-Less Multiplication for Quote Parity

The find_quote_mask function uses carry-less multiplication (CLMUL) to compute quote parity in constant time. This replaces what would otherwise be an O(64) serial XOR loop.

// Mathematical insight: XOR prefix sum can be computed via CLMUL
// quote_bits × 0xFFFFFFFFFFFFFFFF computes the running XOR
//
// Example: quote_bits = 0b00100100 (quotes at positions 2 and 5)
// Result:  0b00111100 (inside quote from position 2 to 5)

auto result = CLMulLower(quote_bits, ~0ULL);
quote_mask = result ^ prev_iter_inside_quote;  // Chain across blocks

This technique, pioneered by simdjson, provides approximately 28× speedup over scalar implementation for quote tracking.

Handling Quoted Fields at Chunk Boundaries

When parsing in parallel, quoted fields may span chunk boundaries. libvroom handles this through:

  1. First pass: Each chunk reports both first_even_nl and first_odd_nl
  2. Cumulative tracking: Running quote count determines which boundary to use
  3. Safe division: Chunks start only at positions where quote state is known
Chunk 1: ...data,"quoted     Chunk 2: field",normal...
                  ↑                   ↑
            (odd quotes)         (even quotes)
            Inside quoted field! Safe to split here
            Use first_odd_nl

If a chunk’s first_even_nl or first_odd_nl is invalid (null_pos), the parser falls back to single-threaded mode for correctness.

Key Components

File Purpose
include/libvroom.h Main public API - Parser class, unified interface
include/two_pass.h Core two-pass parsing algorithm with multi-threading
include/streaming.h Streaming parser for memory-efficient large file processing
include/simd_highway.h Portable SIMD operations using Google Highway
include/branchless_state_machine.h Lookup table-based state machine (branch-free)
include/error.h Error codes, severity levels, ErrorCollector class
include/dialect.h CSV dialect detection (delimiter, quote char)
include/encoding.h Encoding detection and transcoding (UTF-16, UTF-32)
include/io_util.h File loading with SIMD-aligned padding (32+ bytes)
include/mem_util.h Aligned memory allocation for SIMD
include/libvroom_c.h C API wrapper for FFI bindings

Memory Layout

The parser produces an index structure with field boundary positions:

class index {
    uint64_t columns;      // Number of columns detected
    uint8_t n_threads;     // Thread count used (interleave stride)
    uint64_t* n_indexes;   // Index count per thread [n_threads]
    uint64_t* indexes;     // Interleaved field positions
};

Indexes are stored interleaved by thread for cache efficiency during parallel construction:

Thread 0: pos[0], pos[4], pos[8], ...
Thread 1: pos[1], pos[5], pos[9], ...
Thread 2: pos[2], pos[6], pos[10], ...
Thread 3: pos[3], pos[7], pos[11], ...

This interleaving ensures each thread writes to independent cache lines, avoiding false sharing.

Thread Safety

  • First pass: Each thread operates on independent chunk ranges (embarrassingly parallel)
  • Second pass: Threads write to interleaved positions—no overlap, no locks needed
  • Error collection: Thread-local ErrorCollector instances are merged after parsing

The design achieves lock-free parallelism by ensuring threads never write to the same memory locations.

Performance Considerations

Optimization Implementation Impact
SIMD Processing 64-byte blocks via Highway Process 64 chars per iteration
Branchless Parsing Lookup table state machine Eliminate 90%+ branches
CLMUL Quote Tracking Carry-less multiplication 28× faster than scalar XOR
Prefetching __builtin_prefetch hints Hide memory latency
Memory Alignment 64-byte aligned buffers Optimal SIMD loads
Interleaved Output Per-thread index positions No false sharing
Speculative Parsing Backward quote state scan Enable parallelism

Chunk Sizing

  • Minimum: 64 bytes (ensures at least one SIMD block)
  • Recommended: 64KB–1MB for optimal parallelism/overhead balance
  • Fallback: Files smaller than 64 bytes per thread use single-threaded mode

References

The libvroom architecture draws from these foundational papers:

  1. Chang, G., Li, Y., Eilebrecht, E., Chandramouli, B., & Kossmann, D. (2019). Speculative Distributed CSV Data Parsing for Big Data Analytics. SIGMOD ’19: Proceedings of the 2019 International Conference on Management of Data, pp. 883-899. https://doi.org/10.1145/3299869.3319898

    Provides the theoretical foundation for speculative multi-threaded CSV parsing with quote-state tracking at chunk boundaries.

  2. Langdale, G. & Lemire, D. (2019). Parsing Gigabytes of JSON per Second. The VLDB Journal, 28(6), pp. 941-960. https://doi.org/10.1007/s00778-019-00578-5

    Introduces the two-stage SIMD parsing architecture, carry-less multiplication for quote parity, and branchless design patterns used throughout libvroom.

  3. van den Burgh, G.J.J. & Nazabal, A. (2019). Wrangling Messy CSV Files by Detecting Row and Type Patterns. Data Mining and Knowledge Discovery, 33(4), pp. 1415-1441. https://doi.org/10.1007/s10618-019-00646-y

    Provides the basis for libvroom’s dialect detection algorithm (CleverCSV approach).