External Storage¶
When RPC payloads grow large — multi-megabyte query results, bulk data imports, large model artifacts — they can exceed the capacity of the underlying transport. HTTP servers and reverse proxies typically enforce request and response body size limits (e.g., 10 MB). Even pipe-based transports become inefficient when serializing very large batches inline.
External storage solves this by transparently offloading oversized Arrow IPC batches to cloud storage (S3, GCS, or any compatible backend). The batch is replaced with a lightweight pointer batch — a zero-row batch carrying a download URL in its metadata. The receiving side resolves the pointer automatically, fetching the actual data in parallel chunks.
This works in both directions:
- Large outputs — the server externalizes result batches that exceed a size threshold
- Large inputs — the client uploads input data to a pre-signed URL and sends a pointer in the RPC request
From the caller's perspective, nothing changes. The proxy returns the same typed results; the externalization is invisible.
How It Works¶
Large Outputs (Server to Client)¶
When a server method returns a batch that exceeds externalize_threshold_bytes, the framework intercepts the response before writing it to the wire:
- The batch is serialized to Arrow IPC format
- Optionally compressed with zstd
- Uploaded to cloud storage via the configured
ExternalStoragebackend - The storage backend returns a download URL (typically a pre-signed URL with an expiry)
- A pointer batch replaces the original — a zero-row batch with
vgi_rpc.locationmetadata containing the URL - The pointer batch is written to the wire instead of the full data
On the client side, pointer batches are detected and resolved transparently:
- The client reads the pointer batch and extracts the URL from
vgi_rpc.locationmetadata - A HEAD probe determines the content size and whether the server supports range requests
- For large payloads (>64 MB by default), parallel range-request fetching splits the download into chunks with speculative hedging for slow chunks
- For smaller payloads, a single GET request fetches the data
- If the data was compressed, it is decompressed automatically
- The Arrow IPC stream is parsed back into a
RecordBatchand returned to the caller
This works for both unary results and streaming outputs. For streaming, all batches in a single output cycle (including any log batches) are serialized as one IPC stream, uploaded together, and resolved as a unit.
Large Inputs (Client to Server)¶
For the reverse direction — when a client needs to send large data to the server — the framework provides upload URLs. This is particularly important for HTTP transport where request body size limits are common.
- The client calls
request_upload_urls()to get one or more pre-signed URL pairs from the server - Each
UploadUrlcontains anupload_url(for PUT) and adownload_url(for GET) — these may be the same or different depending on the storage backend - The client serializes the large batch to Arrow IPC and uploads it directly to the
upload_urlvia HTTP PUT, bypassing the RPC server entirely - The client sends the RPC request with a pointer batch referencing the
download_url - The server resolves the pointer batch transparently, fetching the data from storage
This pattern lets clients send arbitrarily large payloads even when the HTTP server enforces strict request size limits — the large data goes directly to cloud storage, and only a small pointer crosses the RPC boundary.
The upload URL endpoint is enabled by passing upload_url_provider to make_wsgi_app(). The server advertises this capability via the VGI-Upload-URL-Support: true response header, and clients can discover it with http_capabilities().
Server Configuration¶
from vgi_rpc import Compression, ExternalLocationConfig, FetchConfig, RpcServer, S3Storage
storage = S3Storage(bucket="my-bucket", prefix="rpc-data/")
config = ExternalLocationConfig(
storage=storage,
externalize_threshold_bytes=1_048_576, # 1 MiB — batches above this go to S3
compression=Compression(), # zstd level 3 by default
fetch_config=FetchConfig(
chunk_size_bytes=8 * 1024 * 1024, # 8 MiB parallel chunks
max_parallel_requests=8,
),
)
server = RpcServer(MyService, MyServiceImpl(), external_location=config)
The same ExternalLocationConfig is passed to the client so it can resolve pointer batches in responses:
from vgi_rpc import connect
with connect(MyService, ["python", "worker.py"], external_location=config) as proxy:
result = proxy.large_query() # pointer batches resolved transparently
For HTTP transport with upload URL support:
from vgi_rpc import make_wsgi_app
app = make_wsgi_app(
server,
upload_url_provider=storage, # enables __upload_url__ endpoint
max_upload_bytes=100 * 1024 * 1024, # advertise 100 MiB max upload
)
Client Upload URLs¶
from vgi_rpc import UploadUrl, request_upload_urls
# Get pre-signed upload URLs from the server
urls: list[UploadUrl] = request_upload_urls(
base_url="http://localhost:8080",
count=1,
)
# Upload large data directly to storage (bypasses RPC server)
import httpx
ipc_data = serialize_large_batch(my_batch)
httpx.put(urls[0].upload_url, content=ipc_data)
# Send RPC request with pointer to uploaded data
# The server resolves the pointer transparently
Clients can discover whether the server supports upload URLs:
from vgi_rpc import http_capabilities
caps = http_capabilities("http://localhost:8080")
if caps.upload_url_support:
urls = request_upload_urls("http://localhost:8080", count=5)
Compression¶
When compression is enabled, batches are compressed with zstd before upload and decompressed transparently on fetch:
from vgi_rpc import Compression
# Default: zstd level 3
config = ExternalLocationConfig(
storage=storage,
compression=Compression(), # algorithm="zstd", level=3
)
# Custom compression level (higher = smaller but slower)
config = ExternalLocationConfig(
storage=storage,
compression=Compression(level=9),
)
# Disable compression
config = ExternalLocationConfig(
storage=storage,
compression=None,
)
The storage backend stores the Content-Encoding header alongside the object. When the client fetches the data, the Content-Encoding: zstd header triggers automatic decompression.
Requires pip install vgi-rpc[external] (installs zstandard).
Parallel Fetching¶
For large externalized batches, the client fetches data in parallel chunks using range requests. This significantly reduces download time for multi-megabyte payloads:
- A HEAD probe determines the total size and whether the server supports
Accept-Ranges: bytes - If the payload exceeds
parallel_threshold_bytes(default 64 MB) and ranges are supported, it is split intochunk_size_byteschunks (default 8 MB) fetched concurrently - Speculative hedging: if a chunk takes longer than
speculative_retry_multipliertimes the median chunk time, a hedge request is launched in parallel — the first response wins - For smaller payloads or servers without range support, a single GET request is used
from vgi_rpc import FetchConfig
fetch_config = FetchConfig(
parallel_threshold_bytes=64 * 1024 * 1024, # 64 MiB
chunk_size_bytes=8 * 1024 * 1024, # 8 MiB chunks
max_parallel_requests=8, # concurrent fetches
timeout_seconds=60.0, # overall deadline
max_fetch_bytes=256 * 1024 * 1024, # 256 MiB hard cap
speculative_retry_multiplier=2.0, # hedge at 2x median
max_speculative_hedges=4, # max hedge requests
)
Object Lifecycle¶
Uploaded objects persist indefinitely — vgi-rpc does not delete them. Configure storage-level cleanup policies to auto-expire old data.
S3 lifecycle rule:
aws s3api put-bucket-lifecycle-configuration \
--bucket MY_BUCKET \
--lifecycle-configuration '{
"Rules": [{
"ID": "expire-vgi-rpc",
"Filter": {"Prefix": "rpc-data/"},
"Status": "Enabled",
"Expiration": {"Days": 1}
}]
}'
GCS lifecycle rule:
gsutil lifecycle set <(cat <<EOF
{"rule": [{"action": {"type": "Delete"},
"condition": {"age": 1, "matchesPrefix": ["rpc-data/"]}}]}
EOF
) gs://MY_BUCKET
URL Validation¶
By default, external location URLs are validated with https_only_validator, which rejects non-HTTPS URLs. This prevents pointer batches from being used to probe internal networks. You can provide a custom validator via ExternalLocationConfig.url_validator.
API Reference¶
Configuration¶
ExternalLocationConfig
dataclass
¶
ExternalLocationConfig(
storage: ExternalStorage | None = None,
externalize_threshold_bytes: int = 1048576,
max_retries: int = 2,
retry_delay_seconds: float = 0.5,
fetch_config: FetchConfig = FetchConfig(),
compression: Compression | None = None,
url_validator: (
Callable[[str], None] | None
) = https_only_validator,
)
Configuration for ExternalLocation batch support.
.. note:: Trust boundary — When resolution is enabled, the
client will fetch URLs embedded in server responses. The
default url_validator (https_only_validator) rejects
non-HTTPS URLs, but callers should still only connect to
trusted RPC servers.
| ATTRIBUTE | DESCRIPTION |
|---|---|
storage |
Storage backend for uploading externalized data. Required for production (writing); not needed for resolution-only (reading) scenarios.
TYPE:
|
externalize_threshold_bytes |
Data batch buffer size above
which to externalize. Uses
TYPE:
|
max_retries |
Number of fetch retries (total attempts =
TYPE:
|
retry_delay_seconds |
Delay between retry attempts.
TYPE:
|
fetch_config |
Fetch configuration controlling parallelism, timeouts, and size limits. Defaults to sensible values.
TYPE:
|
compression |
Compression settings for externalized data.
TYPE:
|
url_validator |
Callback invoked before fetching external
URLs. Should raise
TYPE:
|
Compression
dataclass
¶
Compression settings for externalized data.
| ATTRIBUTE | DESCRIPTION |
|---|---|
algorithm |
Compression algorithm. Currently only
TYPE:
|
level |
Compression level (1-22 for zstd).
TYPE:
|
FetchConfig
dataclass
¶
FetchConfig(
parallel_threshold_bytes: int = 64 * 1024 * 1024,
chunk_size_bytes: int = 8 * 1024 * 1024,
max_parallel_requests: int = 8,
timeout_seconds: float = 60.0,
max_fetch_bytes: int = 256 * 1024 * 1024,
speculative_retry_multiplier: float = 2.0,
max_speculative_hedges: int = 4,
)
Configuration for parallel range-request fetching.
Maintains a persistent aiohttp.ClientSession backed by a daemon
thread. Use as a context manager or call close() to release
resources.
| ATTRIBUTE | DESCRIPTION |
|---|---|
parallel_threshold_bytes |
Below this size, use a single GET.
TYPE:
|
chunk_size_bytes |
Size of each Range request chunk.
TYPE:
|
max_parallel_requests |
Semaphore limit on concurrent requests.
TYPE:
|
timeout_seconds |
Overall deadline for the fetch.
TYPE:
|
max_fetch_bytes |
Hard cap on total download size.
TYPE:
|
speculative_retry_multiplier |
Launch a hedge request for chunks
taking longer than
TYPE:
|
max_speculative_hedges |
Maximum number of hedge requests per
fetch. Prevents runaway request amplification under
adversarial or slow-backend conditions.
TYPE:
|
close
¶
Close the pooled session, stop the event loop, and join the thread.
Source code in vgi_rpc/external_fetch.py
__del__
¶
__enter__
¶
__enter__() -> FetchConfig
__exit__
¶
__exit__(
exc_type: type[BaseException] | None,
exc_val: BaseException | None,
exc_tb: TracebackType | None,
) -> None
Exit context manager, closing the pool.
Storage Protocol¶
ExternalStorage
¶
Bases: Protocol
Pluggable storage interface for externalizing large batches.
Implementations must be thread-safe — upload() may be called
concurrently from different server threads.
.. important:: Object lifecycle — vgi-rpc does not manage the
lifecycle of uploaded objects. Data uploaded via upload()
or generate_upload_url() persists indefinitely unless the
server operator configures cleanup. Use storage-level lifecycle
rules (e.g. S3 Lifecycle Policies, GCS Object Lifecycle
Management) or bucket-level TTLs to automatically expire and
delete stale objects.
upload
¶
Upload serialized IPC data and return a URL for retrieval.
The uploaded object is not automatically deleted — server operators are responsible for configuring object cleanup via storage lifecycle rules or TTLs.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Complete Arrow IPC stream bytes.
TYPE:
|
schema
|
The schema of the data being uploaded.
TYPE:
|
content_encoding
|
Optional encoding applied to data
(e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
A URL (typically pre-signed) that can be fetched to retrieve |
str
|
the uploaded data. |
Source code in vgi_rpc/external.py
UploadUrlProvider
¶
Bases: Protocol
Generates pre-signed upload URL pairs.
Implementations must be thread-safe — generate_upload_url()
may be called concurrently from different server threads.
.. important:: Object lifecycle — vgi-rpc does not manage the lifecycle of uploaded objects. Data uploaded via these URLs persists indefinitely unless the server operator configures cleanup. Use storage-level lifecycle rules (e.g. S3 Lifecycle Policies, GCS Object Lifecycle Management) or bucket-level TTLs to automatically expire and delete stale objects.
generate_upload_url
¶
generate_upload_url(schema: Schema) -> UploadUrl
Generate a pre-signed upload/download URL pair.
The caller receives time-limited PUT and GET URLs for a new storage object. The uploaded object is not automatically deleted — server operators are responsible for configuring object cleanup via storage lifecycle rules or TTLs.
| PARAMETER | DESCRIPTION |
|---|---|
schema
|
The Arrow schema of the data to be uploaded. Backends may use this for content-type or metadata hints.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
UploadUrl
|
An |
Source code in vgi_rpc/external.py
UploadUrl
dataclass
¶
Pre-signed URL pair for client-side data upload.
S3/GCS pre-signed URLs are signed per HTTP method, so a PUT URL cannot be used for GET — hence two URLs for the same storage object.
| ATTRIBUTE | DESCRIPTION |
|---|---|
upload_url |
Pre-signed PUT URL for uploading data.
TYPE:
|
download_url |
Pre-signed GET URL for downloading the uploaded data.
TYPE:
|
expires_at |
Expiration time for the pre-signed URLs (UTC).
TYPE:
|
Validation¶
https_only_validator
¶
Reject URLs that do not use the https scheme.
This is the default url_validator for ExternalLocationConfig.
It prevents the client from issuing requests over plain HTTP or other
schemes (ftp, file, etc.) when resolving external-location
pointers.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to validate.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the URL scheme is not |
Source code in vgi_rpc/external.py
S3 Backend¶
Requires pip install vgi-rpc[s3].
S3Storage
dataclass
¶
S3Storage(
bucket: str,
prefix: str = "vgi-rpc/",
presign_expiry_seconds: int = 3600,
region_name: str | None = None,
endpoint_url: str | None = None,
)
S3-backed ExternalStorage using boto3.
.. important:: Object lifecycle — uploaded objects persist
indefinitely. Configure an S3 Lifecycle Policy_ on the
bucket to expire objects under prefix (default
vgi-rpc/) after a suitable retention period.
See :mod:vgi_rpc.external for full details and examples.
| ATTRIBUTE | DESCRIPTION |
|---|---|
bucket |
S3 bucket name.
TYPE:
|
prefix |
Key prefix for uploaded objects.
TYPE:
|
presign_expiry_seconds |
Lifetime of pre-signed GET URLs.
TYPE:
|
region_name |
AWS region (
TYPE:
|
endpoint_url |
Custom endpoint for S3-compatible services (e.g. MinIO, LocalStack).
TYPE:
|
.. _S3 Lifecycle Policy: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html
upload
¶
Upload IPC data to S3 and return a pre-signed GET URL.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Serialized Arrow IPC stream bytes.
TYPE:
|
schema
|
Schema of the data (unused but required by protocol).
TYPE:
|
content_encoding
|
Optional encoding applied to data
(e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
A pre-signed URL that can be used to download the data. |
Source code in vgi_rpc/s3.py
generate_upload_url
¶
generate_upload_url(schema: Schema) -> UploadUrl
Generate pre-signed PUT and GET URLs for client-side upload.
The created S3 object is not automatically deleted. Configure S3 Lifecycle Policies on the bucket to expire objects after a suitable retention period.
| PARAMETER | DESCRIPTION |
|---|---|
schema
|
The Arrow schema of the data to be uploaded (unused but available for metadata hints).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
UploadUrl
|
An |
UploadUrl
|
same S3 object. |
Source code in vgi_rpc/s3.py
GCS Backend¶
Requires pip install vgi-rpc[gcs].
GCSStorage
dataclass
¶
GCSStorage(
bucket: str,
prefix: str = "vgi-rpc/",
presign_expiry_seconds: int = 3600,
project: str | None = None,
)
GCS-backed ExternalStorage using google-cloud-storage.
.. important:: Object lifecycle — uploaded objects persist
indefinitely. Configure Object Lifecycle Management_ on the
bucket to delete objects under prefix (default
vgi-rpc/) after a suitable retention period.
See :mod:vgi_rpc.external for full details and examples.
| ATTRIBUTE | DESCRIPTION |
|---|---|
bucket |
GCS bucket name.
TYPE:
|
prefix |
Key prefix for uploaded objects.
TYPE:
|
presign_expiry_seconds |
Lifetime of signed GET URLs.
TYPE:
|
project |
GCS project ID (
TYPE:
|
.. _Object Lifecycle Management: https://cloud.google.com/storage/docs/lifecycle
upload
¶
Upload IPC data to GCS and return a signed GET URL.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Serialized Arrow IPC stream bytes.
TYPE:
|
schema
|
Schema of the data (unused but required by protocol).
TYPE:
|
content_encoding
|
Optional encoding applied to data
(e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
A signed URL that can be used to download the data. |
Source code in vgi_rpc/gcs.py
generate_upload_url
¶
generate_upload_url(schema: Schema) -> UploadUrl
Generate signed PUT and GET URLs for client-side upload.
The created GCS object is not automatically deleted. Configure GCS Object Lifecycle Management on the bucket to expire objects after a suitable retention period.
| PARAMETER | DESCRIPTION |
|---|---|
schema
|
The Arrow schema of the data to be uploaded (unused but available for metadata hints).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
UploadUrl
|
An |
UploadUrl
|
same GCS object. |