# Dataset Storage and R2 Utilities

This document describes the dataset management workflow and the S3-compatible
R2 utilities added to the S6 CLI. It covers the `s6 dataset` master
command, the `s6 r2` helper commands, and the underlying `R2Client` with API
and behavior details.

## Concepts

- Dataset directory: a folder that contains a `data.jsonl` file. Examples live
  under `./temp`.
- Local root: by default all dataset operations target `./temp`.
- Remote layout: datasets are stored under the R2 prefix `datasets/<name>/` by
  default (configurable). Keys preserve directory structure on upload/download.

## Motivation

Research and development loops produce sizable datasets (raw frames, previews,
JSONL annotations, logs) that must be shared across local developer laptops,
remote teammates, and lab machines. We wanted a single, predictable interface
to move datasets without coupling the rest of the codebase to any vendor SDK
or filesystem mount.

Cloudflare R2 is S3‑compatible and economical (no egress fees), so we target a
generic S3 API and keep the storage endpoint configurable. That gives us:

- A uniform path model (keys, prefixes) that maps cleanly to dataset folders.
- Credentials via environment variables to avoid hard‑coding secrets.
- Swappability (R2 today, MinIO/AWS tomorrow) by only changing the endpoint.
- Clear safety semantics (no accidental overwrites) for collaborative work.

In short, the goal is to make “record → upload → iterate → download” as simple
and reliable as copying a folder locally, but robust across machines.

## Credentials and Defaults

R2 is S3-compatible and uses standard S3 credentials. The client resolves
credentials/region from environment variables when not explicitly provided.

- Access key: `R2_ACCESS_KEY_ID` or `AWS_ACCESS_KEY_ID`
- Secret key: `R2_SECRET_ACCESS_KEY` or `AWS_SECRET_ACCESS_KEY`
- Region: `R2_REGION_NAME` or `AWS_REGION` or `AWS_DEFAULT_REGION` (optional)
- Bucket default: `assets` (override with `-b/--bucket` or `R2_BUCKET`)
- Endpoint default: ****
  (override with `-e/--endpoint` or `R2_ENDPOINT`/`R2_ENDPOINT_URL`)

All CLI commands accept flags to override bucket/endpoint/region and also
honor the environment variables listed above.

## Setup

Follow these steps to obtain and configure R2 credentials with the standard
environment variables we use (`R2_ACCESS_KEY_ID` and `R2_SECRET_ACCESS_KEY`).

1) Create an Access Key in Cloudflare R2

- Sign in to the Cloudflare dashboard.
- Navigate to R2 → S3 API → Create Access Key (or Manage R2 → Create access key).
- Choose permissions (typically Object Read/Write) and optionally restrict to
  specific buckets.
- Copy the generated Access Key ID and Secret Access Key. You will not be able
  to view the secret again after closing the dialog.

2) Find your Account ID and Endpoint URL

- In the R2 S3 API section, note your Account ID. The endpoint URL is:
  `https://<ACCOUNT_ID>.r2.cloudflarestorage.com`
- Example used in this repo’s defaults:
  `https://1195172285921be7f47e85de5cc4a5ad.r2.cloudflarestorage.com`

3) Export credentials and endpoint in your shell

```bash
# Required credentials
export R2_ACCESS_KEY_ID="<YOUR_ACCESS_KEY_ID>"
export R2_SECRET_ACCESS_KEY="<YOUR_SECRET_ACCESS_KEY>"

# Optional: set endpoint and bucket (overrides CLI defaults)
export R2_ENDPOINT="https://<ACCOUNT_ID>.r2.cloudflarestorage.com"
export R2_BUCKET="assets"

# Optional: set region if needed; R2 typically works with "auto"
export R2_REGION_NAME="auto"
```

To make these persistent, add the exports to your shell profile (e.g.,
`~/.zshrc` or `~/.bashrc`) and reload your shell.

4) Validate the setup

```bash
# List at root (uses defaults if set); add -b/-e to override
s6 r2 list -b "${R2_BUCKET:-assets}" -e "${R2_ENDPOINT}"

# Or list datasets one level deep
s6 r2 list datasets/ --flat -b "${R2_BUCKET:-assets}" -e "${R2_ENDPOINT}"
```

If these commands return object keys or prefixes (instead of auth errors), your
credentials and endpoint are configured correctly.

## CLI: `s6 r2` Utilities

Location: `src/s6/app/r2`

Shared flags (on all subcommands):

- `-b, --bucket`: bucket name (default `assets`)
- `-e, --endpoint`: endpoint URL (default to the endpoint above)
- `--region`: explicit region override (optional)

### List

Command: `s6 r2 list [prefix] [--flat]`

- Lists objects under an optional prefix. By default, lists recursively.
- `--flat` performs a one-level list and also prints child prefixes.

Examples:

- `s6 r2 list`
- `s6 r2 list datasets/ --flat`
- `s6 r2 list datasets/my_set/`

### Download

Command: `s6 r2 download <object-or-prefix> [-o DIR] [-w N] [-p] [--overwrite]`

- If argument ends with `/`, it is a prefix download; otherwise the command
  auto-detects an exact object vs. a prefix fallback.
- `-o, --output` sets the local destination directory (default `temp`).
- `-w, --workers` controls parallelism for prefix downloads (default `8`).
- `-p, --progress` shows a running progress counter.
- `--overwrite` allows replacing existing local files. Without it, the command
  aborts early if any target path already exists (fail-fast safety).

Examples:

- `s6 r2 download datasets/my_set/ -o temp -w 12 -p`
- `s6 r2 download datasets/my_set/data.jsonl -o temp -p`

### Upload

Command: `s6 r2 upload <local-path> <dest-key-or-prefix> [-w N] [-p] [--overwrite]`

- If `<local-path>` is a directory, uploads recursively under the destination
  prefix; if the destination does not end with `/`, one is appended.
- If `<local-path>` is a file and destination ends with `/`, the basename is
  appended; otherwise it uploads to the provided key.
- `-w, --workers` controls parallelism for directory uploads (default `8`).
- `-p, --progress` enables progress output.
- Without `--overwrite`, upload is fail-fast if any destination keys already
  exist (preflight check via prefix listing).

Examples:

- `s6 r2 upload ./temp/diverse_2 datasets/diverse_2/ -w 12 -p`
- `s6 r2 upload ./temp/diverse_2/data.jsonl datasets/diverse_2/data.jsonl -p`

### Delete

Command: `s6 r2 delete <key-or-prefix> [-r|--recursive]`

- Deletes exactly one object unless `--recursive` is provided or the argument
  ends with `/`, in which case it deletes all objects under the prefix.

Examples:

- `s6 r2 delete datasets/diverse_2/data.jsonl`
- `s6 r2 delete datasets/diverse_2/ -r`

## CLI: `s6 dataset` Master Command

Location: `src/s6/app/dataset.py`

Global flags (apply to all subcommands):

- `--root`: local root for dataset directories (default `temp`)
- `--remote-prefix`: remote base prefix (default `datasets/`)
- `-w, --workers`: parallel workers for upload/download (default `8`)
- R2 flags: `-b/--bucket`, `-e/--endpoint`, `--region` (same defaults as above)

Subcommands:

### List

`s6 dataset list [--remote] [--local] [--local-only]`

- Lists dataset names (folders with `data.jsonl`).
- Local: scans `--root` (default `temp`).
- Remote: lists one-level prefixes under `--remote-prefix` (default
  `datasets/`).
- If no `--local/--remote` flags are passed, local is shown by default.

### Upload

`s6 dataset upload <name> [--overwrite]`

- Uploads local dataset `--root/<name>` to remote `--remote-prefix/<name>/`.
- Uses parallel uploads with progress enabled by default.
- Without `--overwrite`, performs a fail-fast preflight: if any destination key
  already exists, aborts before uploading.

### Download

`s6 dataset download <name> [--overwrite]`

- Downloads remote dataset `--remote-prefix/<name>/` into local
  `--root/<name>`.
- Uses parallel downloads with progress enabled by default.
- Without `--overwrite`, aborts early if any local target files already exist.
- Warns if the downloaded directory does not contain `data.jsonl`.

### Delete

`s6 dataset delete <name> [-y|--yes] [--local]`

- Deletes the remote dataset at `--remote-prefix/<name>/` (requires `--yes`).
- With `--local`, also removes the local dataset directory.

## R2Client API

Location: `src/s6/utils/r2.py`

### Design Overview

The storage client and the CLI are layered to keep responsibilities clear:

- R2Client (library)
  - Thin façade over `boto3` S3 client bound to a specific bucket/endpoint.
  - Resolves credentials/region from args or environment (R2_* or AWS_*).
  - Implements minimal primitives we rely on: list, upload, download, delete.
  - Listings paginate and optionally use a delimiter for one‑level views.
  - Upload helpers refuse to overwrite by default; directory uploads allow an
    explicit `overwrite=True`.
  - Existence is checked via `HEAD` to avoid listing entire buckets.
  - Imports are guarded so documentation builds don’t require `boto3`.

- CLI utilities (s6 r2 / s6 dataset)
  - Compose R2Client primitives into higher‑level operations with progress and
    concurrency controls.
  - Map local directories to object keys by preserving relative paths under a
    chosen prefix (`datasets/<name>/`).
  - Perform preflight safety checks to achieve fail‑fast, all‑or‑nothing
    behavior in collaborative environments.

See the module docs in `s6.utils.r2` for API‑level details and examples.

### Construction

```python
R2Client(
    access_key_id: Optional[str] = None,
    secret_access_key: Optional[str] = None,
    bucket_name: str = "",
    endpoint_url: str = "",
    region_name: str = "auto",
    session: Optional[boto3.session.Session] = None,
)
```

- Credentials/region are read from environment when arguments are `None`.
- `region_name="auto"` uses `R2_REGION_NAME`/`AWS_REGION`/`AWS_DEFAULT_REGION`
  when present.

### Listing

```python
list(prefix: str = "", recursive: bool = True) -> tuple[list[R2Object], list[str]]
```

- Returns `(objects, prefixes)`; `prefixes` is populated only when
  `recursive=False` (one-level list).

`R2Object` fields:

- `key: str`
- `size: int`
- `last_modified: Any`
- `etag: str`

### Download

```python
download_file(key: str, local_path: str, *, progress: bool = False) -> None
download_directory(
    prefix: str,
    local_dir: str,
    *,
    max_workers: int = 8,
    progress: bool = False,
    overwrite: bool = False,
) -> None
```

- Directory downloads are parallelized with a thread pool and show aggregate
  progress when `progress=True`.
- When `overwrite=False`, a preflight check ensures none of the target files
  already exist; otherwise it raises `FileExistsError` (fail-fast).

### Upload

```python
upload_file(local_path: str, key: str, *, progress: bool = False) -> None
upload_directory(
    local_dir: str,
    prefix: str,
    *,
    overwrite: bool = False,
    max_workers: int = 8,
    progress: bool = False,
) -> None
upload_bytes(data: bytes, key: str) -> None
```

- Directory uploads are parallelized; with `progress=True`, bytes across all
  files are aggregated and reported.
- When `overwrite=False`, a preflight uses a single remote listing to detect
  any destination collisions and aborts early with `FileExistsError`.
- `upload_file` and `upload_bytes` never overwrite; delete first to replace.

### Delete

```python
delete_object(key: str, missing_ok: bool = True) -> None
```

- Deletes an individual object. With `missing_ok=False`, raises if not found.

### Concurrency and Progress

- Per-transfer concurrency is configured via `boto3.s3.transfer.TransferConfig`
  when available; otherwise defaults are used.
- Directory-level concurrency is controlled by `max_workers` on the client
  methods and CLI flags.
- Progress is printed to stderr at ~5 Hz in the form
  `Uploading: X/Y bytes (Z%)` or `Downloading: ...` and concludes with a newline.

## Safety Semantics

- Fail-fast no-overwrite: both `upload_directory` and `download_directory`
  perform preflight checks when `overwrite=False` and abort before starting any
  transfers if conflicts are found.
- Prefix/object detection: the download CLI treats a trailing `/` as a prefix;
  otherwise it tries exact key first and falls back to prefix if appropriate.

## Examples

### Multi‑User Workflow: Record → Upload → Replay

This workflow shows how one teammate records a dataset and another replays it
remotely using the shared R2 storage.

On Machine A (Recorder):

```bash
# 1) Record from live input into a local dataset directory
s6 track -i network -o ./temp/run_net_01 -r -x

# 2) Upload the dataset to shared storage
s6 dataset upload run_net_01
# or equivalently, using r2
s6 r2 upload ./temp/run_net_01 datasets/run_net_01/ -w 12 -p
```

On Machine B (Consumer):

```bash
# 3) Download the dataset locally
s6 dataset download run_net_01
# or with r2
s6 r2 download datasets/run_net_01/ -o ./temp -w 12 -p

# 4) Replay the dataset for development and testing
s6 track -i ./temp/run_net_01 --repeat -x
```

Tips:
- Use `--config <file.yaml>` with `s6 track` to test different pipeline
  configurations against the same dataset.
- Keep `--repeat` on during development for quick iterative cycles.
- Profiling with `-x` writes Chrome trace logs; see
  `docs/recipes/pipeline_chrome_trace.md`.

### End-to-end roundtrip for a dataset named `diverse_2`

```bash
# Upload with defaults (parallel + progress)
s6 dataset upload diverse_2

# On another machine, download it
s6 dataset download diverse_2

# Inspect locally and remotely
s6 dataset list
s6 dataset list --remote

# Clean up remotely
s6 dataset delete diverse_2 --yes
```

Direct R2 usage:

```bash
s6 r2 list datasets/ --flat
s6 r2 upload ./temp/diverse_2 datasets/diverse_2/ -w 12 -p
s6 r2 download datasets/diverse_2/ -o temp -w 12 -p
s6 r2 delete datasets/diverse_2/ -r
```

## Troubleshooting

- Missing credentials: set `R2_ACCESS_KEY_ID`/`R2_SECRET_ACCESS_KEY` (or AWS_*)
  and ensure your endpoint URL is correct for your R2 account.
- Partial results: operations should be all-or-nothing when `overwrite=False`.
  If you need to replace files, pass `--overwrite` (download) or
  `--overwrite` (dataset/r2 upload). For single-file uploads, delete first.
- Performance: increase `-w/--workers` for more parallelism. Very large
  datasets benefit from higher worker counts; monitor your network/limits.