libvroom

High-performance CSV parser using SIMD instructions

CI codecov

Overview

libvroom is a high-performance CSV parser library using portable SIMD instructions via Google Highway, designed for future integration with R’s vroom package.

The parser achieves 4+ GB/s throughput on modern hardware using a speculative multi-threaded two-pass algorithm based on research by Chang et al. (SIGMOD 2019) and SIMD techniques from Langdale & Lemire (simdjson).

Key Features

Parsing

  • SIMD-accelerated parsing using Google Highway
  • Multi-threaded speculative parsing for large files
  • Streaming parser for memory-constrained environments
  • Index caching for instant re-reads of parsed files

Detection & Handling

  • Automatic dialect detection (delimiter, quotes, line endings)
  • Encoding detection with UTF-16/UTF-32 transcoding
  • Schema inference with type detection
  • Three error modes: strict, permissive, best-effort

Quick Start

Python

pip install vroom-csv
import vroom_csv

table = vroom_csv.read_csv("data.csv")
print(f"Loaded {table.num_rows} rows, {table.num_columns} columns")

# Zero-copy export to PyArrow, Polars, DuckDB
import polars as pl
df = pl.from_arrow(table)

See the Python Package documentation for complete usage.

C++ / CLI

# Clone and build
git clone https://github.com/jimhester/libvroom.git
cd libvroom
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Try the CLI
./build/vroom info your_data.csv
./build/vroom head -n 5 your_data.csv
./build/vroom schema your_data.csv

See the Getting Started guide for detailed build instructions.

CLI Tool

The vroom command-line tool provides fast CSV processing:

Command Description
count Count rows in a CSV file
head Display first N rows
tail Display last N rows
sample Random sample of N rows
select Select columns by name or index
pretty Pretty-print with aligned columns
info Display file metadata
dialect Detect CSV dialect
schema Infer column types
stats Column statistics (min, max, mean)

See the CLI Reference for complete documentation.

C++ Library

#include <libvroom.h>

libvroom::FileBuffer buffer = libvroom::load_file("data.csv");
libvroom::Parser parser(4);  // 4 threads

auto result = parser.parse(buffer.data(), buffer.size());

// Iterate over rows with type-safe access
for (auto row : result.rows()) {
    auto name = row.get<std::string>("name");
    auto value = row.get<double>("value");
}

Performance

Single-threaded throughput on Apple Silicon (M3 Max):

File Size Throughput vs zsv
10 MB 3.1 GB/s 4.0x faster
100 MB 4.4 GB/s 4.6x faster
200 MB 4.7 GB/s 4.9x faster

Multi-threaded parsing reaches 6+ GB/s on large files. See Benchmarks for detailed comparisons.

Architecture

The parser implements a two-pass algorithm optimized for parallel processing:

  1. First Pass: Scans for line boundaries while tracking quote parity to find safe chunk split points
  2. Second Pass: SIMD-based field indexing using a state machine, processing 64 bytes per iteration

Learn more in the Architecture documentation.

Documentation

Topic Description
Getting Started Build instructions and basic usage
CLI Reference Command-line tool documentation
Streaming Parser Memory-efficient parsing for large files
Index Caching Speed up repeated file reads
C API Reference C bindings for FFI integration
Python Package Python bindings with Arrow support
Architecture Two-pass algorithm details
Error Handling Error modes and recovery
Integration Guide CMake integration and build options
Benchmarks Performance comparisons

References