libvroom

High-performance CSV parser using SIMD instructions

Overview

libvroom is a high-performance CSV parser library using portable SIMD instructions via Google Highway, designed for future integration with R’s vroom package.

The parser achieves 4+ GB/s throughput on modern hardware using a speculative multi-threaded two-pass algorithm based on research by Chang et al. (SIGMOD 2019) and SIMD techniques from Langdale & Lemire (simdjson).

Key Features

Parsing

SIMD-accelerated parsing using Google Highway
Multi-threaded speculative parsing for large files
Streaming parser for memory-constrained environments
Index caching for instant re-reads of parsed files

Detection & Handling

Automatic dialect detection (delimiter, quotes, line endings)
Encoding detection with UTF-16/UTF-32 transcoding
Schema inference with type detection
Three error modes: strict, permissive, best-effort

Quick Start

Python

pip install vroom-csv

import vroom_csv

table = vroom_csv.read_csv("data.csv")
print(f"Loaded {table.num_rows} rows, {table.num_columns} columns")

# Zero-copy export to PyArrow, Polars, DuckDB
import polars as pl
df = pl.from_arrow(table)

See the Python Package documentation for complete usage.

C++ / CLI

# Clone and build
git clone https://github.com/jimhester/libvroom.git
cd libvroom
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Try the CLI
./build/vroom info your_data.csv
./build/vroom head -n 5 your_data.csv
./build/vroom schema your_data.csv

See the Getting Started guide for detailed build instructions.

CLI Tool

The vroom command-line tool provides fast CSV processing:

Command	Description
`count`	Count rows in a CSV file
`head`	Display first N rows
`tail`	Display last N rows
`sample`	Random sample of N rows
`select`	Select columns by name or index
`pretty`	Pretty-print with aligned columns
`info`	Display file metadata
`dialect`	Detect CSV dialect
`schema`	Infer column types
`stats`	Column statistics (min, max, mean)

See the CLI Reference for complete documentation.

C++ Library

#include <libvroom.h>

libvroom::FileBuffer buffer = libvroom::load_file("data.csv");
libvroom::Parser parser(4);  // 4 threads

auto result = parser.parse(buffer.data(), buffer.size());

// Iterate over rows with type-safe access
for (auto row : result.rows()) {
    auto name = row.get<std::string>("name");
    auto value = row.get<double>("value");
}

Performance

Single-threaded throughput on Apple Silicon (M3 Max):

File Size	Throughput	vs zsv
10 MB	3.1 GB/s	4.0x faster
100 MB	4.4 GB/s	4.6x faster
200 MB	4.7 GB/s	4.9x faster

Multi-threaded parsing reaches 6+ GB/s on large files. See Benchmarks for detailed comparisons.

Architecture

The parser implements a two-pass algorithm optimized for parallel processing:

First Pass: Scans for line boundaries while tracking quote parity to find safe chunk split points
Second Pass: SIMD-based field indexing using a state machine, processing 64 bytes per iteration

Learn more in the Architecture documentation.

Documentation

Topic	Description
Getting Started	Build instructions and basic usage
CLI Reference	Command-line tool documentation
Streaming Parser	Memory-efficient parsing for large files
Index Caching	Speed up repeated file reads
C API Reference	C bindings for FFI integration
Python Package	Python bindings with Arrow support
Architecture	Two-pass algorithm details
Error Handling	Error modes and recovery
Integration Guide	CMake integration and build options
Benchmarks	Performance comparisons

References

Ge, Chang and Li, Yinan and Eilebrecht, Eric and Chandramouli, Badrish and Kossmann, Donald. Speculative Distributed CSV Data Parsing for Big Data Analytics. SIGMOD 2019.
Geoff Langdale, Daniel Lemire. Parsing Gigabytes of JSON per Second. VLDB Journal 28 (6), 2019.