Python Package

vroom-csv: High-performance CSV parsing for Python

Overview

The vroom-csv Python package provides Python bindings for the libvroom CSV parser, featuring:

  • SIMD-accelerated parsing using Google Highway
  • Multi-threaded parsing for large files
  • Automatic dialect detection (delimiter, quoting, line endings)
  • Arrow PyCapsule interface for zero-copy interoperability with PyArrow, Polars, DuckDB

Installation

pip install vroom-csv

Optional Dependencies

Install with Arrow support for zero-copy DataFrame interoperability:

pip install vroom-csv[arrow]    # PyArrow integration
pip install vroom-csv[polars]   # Polars integration
pip install vroom-csv[all]      # All optional dependencies

Quick Start

Basic Usage

import vroom_csv

# Read a CSV file
table = vroom_csv.read_csv("data.csv")
print(f"Loaded {table.num_rows} rows, {table.num_columns} columns")

# Access column names
print(table.column_names)

# Get a column by name or index
col = table.column("name")
col = table.column(0)

# Get a row by index
row = table.row(0)

Dialect Detection

# Detect CSV dialect (delimiter, quoting, etc.)
dialect = vroom_csv.detect_dialect("data.csv")
print(f"Delimiter: {dialect.delimiter!r}")
print(f"Quote char: {dialect.quote_char!r}")
print(f"Has header: {dialect.has_header}")
print(f"Line ending: {dialect.line_ending!r}")
print(f"Confidence: {dialect.confidence:.2f}")

Progress Reporting

# Show progress for large files
table = vroom_csv.read_csv("large.csv", progress=vroom_csv.default_progress)

Arrow Interoperability

The vroom-csv package implements the Arrow PyCapsule interface for zero-copy data sharing with other Arrow-compatible libraries.

PyArrow

import pyarrow as pa
import vroom_csv

# Zero-copy conversion to PyArrow Table
table = vroom_csv.read_csv("data.csv")
arrow_table = pa.table(table)

# Export to Parquet
import pyarrow.parquet as pq
pq.write_table(arrow_table, "data.parquet")

Polars

import polars as pl
import vroom_csv

# Zero-copy conversion to Polars DataFrame
table = vroom_csv.read_csv("data.csv")
df = pl.from_arrow(table)

# Use Polars operations
result = df.select(["name", "value"]).filter(pl.col("value") > 100)

DuckDB

import duckdb
import vroom_csv

# Query CSV data with DuckDB
table = vroom_csv.read_csv("data.csv")
result = duckdb.query("SELECT * FROM table WHERE value > 100")

Memory-Efficient Processing

Batched Reading

For files too large to fit in memory, use batched reading:

import vroom_csv

# Process in batches of 10,000 rows
for batch in vroom_csv.read_csv_batched("huge.csv", batch_size=10000):
    print(f"Processing batch with {batch.num_rows} rows")
    # Each batch supports Arrow interface
    # df = pl.from_arrow(batch)

Row-by-Row Streaming

For minimal memory usage, iterate row by row:

import vroom_csv

# Stream rows one at a time
for row in vroom_csv.read_csv_rows("huge.csv"):
    # row is a dict: {"col1": "value1", "col2": "value2", ...}
    print(row)

Common Options

Column Selection

# Read only specific columns (by name or index)
table = vroom_csv.read_csv("data.csv", usecols=["name", "value"])
table = vroom_csv.read_csv("data.csv", usecols=[0, 2, 5])

Row Limits

# Skip header rows and limit row count
table = vroom_csv.read_csv("data.csv", skip_rows=2, n_rows=1000)

Null Value Handling

# Custom null value strings
table = vroom_csv.read_csv(
    "data.csv",
    null_values=["NA", "NULL", "N/A", ""],
    empty_is_null=True
)

Multi-threaded Parsing

# Use multiple threads for faster parsing
table = vroom_csv.read_csv("large.csv", num_threads=4)

Memory Mapping

# Force memory mapping for large files
table = vroom_csv.read_csv("huge.csv", memory_map=True)

# Disable memory mapping (loads entire file into memory)
table = vroom_csv.read_csv("data.csv", memory_map=False)

Error Handling

import vroom_csv

try:
    table = vroom_csv.read_csv("data.csv")
except vroom_csv.ParseError as e:
    print(f"Parse error: {e}")
except vroom_csv.IOError as e:
    print(f"I/O error: {e}")
except vroom_csv.VroomError as e:
    print(f"General error: {e}")

# Check for non-fatal parse errors
if table.has_errors():
    print(table.error_summary())
    for error in table.errors():
        print(f"  - {error}")

API Reference

See the Python API Reference for complete documentation of all functions and classes.

Performance

The vroom-csv package achieves high performance through:

  • SIMD acceleration: Uses AVX2/NEON instructions for parallel processing
  • Multi-threaded parsing: Divides large files across CPU cores
  • Memory mapping: Avoids copying file data when possible
  • Zero-copy Arrow: Shares data with other libraries without copying

Typical throughput on modern hardware:

File Size Single-threaded Multi-threaded (4 cores)
10 MB 3+ GB/s 5+ GB/s
100 MB 4+ GB/s 6+ GB/s

Source Code

The Python package source is in the python/ directory of the libvroom repository.