s6.utils.datapipe

Torch-compatible Dataset around StructuredDataset.

Provides datakey-based selection of nested fields and optional balancing across one or more StructuredDataset directories.

Example

Suppose you have a StructuredDataset stored in “./temp” with samples like:

{“image”: <np.ndarray>, “label”: 0} {“image”: <np.ndarray>, “label”: 1}

You can wrap it as a PyTorch Dataset:

from s6.utils.datapipe import StructuredDatasetTorch
from torch.utils.data import DataLoader

dataset = StructuredDatasetTorch(
    "./temp",                  # path to one or more base dirs
    datakeys=["image", "label"],
    balance=False,              # no balancing across dirs
    shuffle=True,               # shuffle entries
    seed=42,                    # reproducible
)
loader = DataLoader(dataset, batch_size=8, shuffle=False)
for images, labels in loader:
    # images: torch.Tensor of shape [8, ...]
    # labels: torch.Tensor of shape [8]
    ...
class s6.utils.datapipe.StructuredDatasetTorch(dataset_dirs: str | List[str], datakeys: List[str], balance: bool = True, shuffle: bool = True, seed: int | None = None)

Bases: Dataset

A torch.utils.data.Dataset wrapper over one or more StructuredDataset directories.

It extracts specific nested fields (via datakey strings) and returns them as torch.Tensor when possible, falling back to Python values otherwise.

Supports balancing across multiple StructuredDataset sources and initial shuffling.

Nested fields are specified with dot-list syntax, e.g.: ‘a.b[0].c’ → sample[‘a’][‘b’][0][‘c’] or getattr if it is a model.

Parameters:
  • dataset_dirs (str or List[str]) – Path or list of paths to one or more StructuredDataset directories.

  • datakeys (List[str]) – List of datakey strings indicating which fields to extract per sample.

  • balance (bool, default=True) – If True and multiple dirs are given, undersample larger datasets to match the smallest one. If False, include all samples sequentially.

  • shuffle (bool, default=True) – If True, shuffle the index mapping once upon initialization.

  • seed (Optional[int], default=None) – Optional random seed for reproducible shuffling.

Return type:

Single torch.Tensor if only one datakey is provided, else a tuple of Tensors/values.

Example

from s6.utils.datapipe import StructuredDatasetTorch

# Wrap a single StructuredDataset directory and extract ‘image’ & ‘label’ ds = StructuredDatasetTorch(

‘./temp’, datakeys=[‘image’, ‘label’], balance=False, shuffle=True, seed=0,

) # Now ds[i] → (image_tensor, label_tensor) img, lbl = ds[0] print(img.shape, lbl.item())

property line_data: List[dict]

List of raw JSON records (unrestored) for all underlying StructuredDatasets.