Skip to content

Benchmarks

Performance benchmarks measured with pytest-benchmark on a single machine (Apple Silicon). All times are median values across multiple iterations.

Run benchmarks yourself

uv run pytest tests/test_benchmarks.py --benchmark-enable --benchmark-only \
    -o "addopts=" --timeout=300

Serialization

Round-trip Arrow IPC serialization and deserialization for dataclasses and raw RecordBatch objects.

Benchmark Median Description
Serialize primitive dataclass 39 us 4 fields: str, int, float, bool
Deserialize primitive dataclass 69 us Same 4-field dataclass
Serialize complex dataclass 101 us Enum, dict, frozenset, list fields
Deserialize complex dataclass 113 us Same complex dataclass
Serialize 10K-row batch 18 us int64 + float64 + utf8 columns
Deserialize 10K-row batch 43 us Same 10K-row batch

Raw RecordBatch serialization is significantly faster than dataclass serialization because dataclasses require Python-level field packing/unpacking on top of the Arrow IPC encoding.

End-to-End RPC Calls

Full round-trip latency for unary and streaming calls across all transports. Includes serialization, transport overhead, dispatch, and deserialization.

Unary Methods

Method Pipe Subprocess Unix Unix (threaded) Shared Memory Pool HTTP
noop() 0.11 ms 0.07 ms 0.07 ms 0.07 ms 0.11 ms 0.07 ms 0.50 ms
add(a, b) 0.17 ms 0.09 ms 0.10 ms 0.10 ms 0.44 ms 0.09 ms 0.57 ms
greet(name) 0.15 ms 0.08 ms 0.10 ms 0.09 ms 0.40 ms 0.08 ms 0.52 ms
roundtrip_types(...) 0.25 ms 0.14 ms 0.15 ms 0.16 ms 0.51 ms 0.15 ms 0.61 ms
  • Subprocess, Unix, and pool transports have the lowest latency (~0.07-0.16 ms)
  • Pipe is slightly higher due to thread coordination overhead
  • Shared memory carries more setup overhead for simple calls but shines with large batches
  • HTTP adds ~0.5 ms baseline from the Falcon/httpx stack (in-process WSGI, no network)

Streaming

Method Pipe Subprocess Unix Unix (threaded) Shared Memory Pool HTTP
Producer (50 batches) 3.5 ms 2.3 ms 3.7 ms 4.1 ms 7.3 ms 2.3 ms 9.2 ms
Exchange (20 rounds) 3.3 ms 1.3 ms 1.7 ms 2.0 ms 6.9 ms 1.4 ms 24 ms
  • Producer streams generate 50 batches; exchange streams perform 20 bidirectional exchanges
  • HTTP exchange is the slowest because each round-trip goes through the full WSGI stack with stream state token serialization
  • Subprocess and pool transports benefit from OS-level pipe buffering for streaming workloads

Schema Generation

Benchmark Median Description
Schema generation (uncached) 51 us Full _generate_schema from type hints
Schema generation (cached) 0.3 us Cached descriptor access

Schema generation is a one-time cost per dataclass. After the first access, ARROW_SCHEMA is cached on the class via a descriptor — subsequent accesses are ~170x faster.

Memory Bounds

Benchmark Median Peak Memory Description
Serialize 100K-row batch 0.33 ms < 50 MB int64 + float64 + utf8 columns
Deserialize 100K-row batch 0.40 ms < 50 MB Same 100K-row batch

Memory usage stays well within bounds for large batches. The 50 MB assertion in the benchmark suite ensures regressions are caught automatically.

Methodology

  • Timer: time.perf_counter (wall clock)
  • Values reported: Median (robust against outliers from GC pauses, OS scheduling)
  • Calibration: pytest-benchmark auto-calibrates iteration count for statistical significance
  • Environment: Apple Silicon (M-series), Python 3.13. Results vary by hardware; run locally for your own baseline
  • Test fixture: RPC benchmarks use the make_conn fixture parametrized over pipe, subprocess, Unix, Unix threaded, shared memory, pool, and HTTP transports