Batch Sealing with SupraSeal

This page explains how to setup supraseal batch sealer in Curio

SupraSeal is an optimized batch sealing implementation for Filecoin that allows sealing multiple sectors in parallel. It can significantly improve sealing throughput compared to sealing sectors individually.

Key Features

  • Seals multiple sectors (up to 128) in a single batch

    • Up to 16x better core utilisation efficiency

  • Optimized to utilize CPU and GPU resources efficiently

  • Uses raw NVMe devices for layer storage instead of RAM

Requirements

  • CPU with at least 4 cores per CCX (AMD) or equivalent

  • NVMe drives with high IOPS (10-20M total IOPS recommended)

  • GPU for PC2 phase (NVIDIA RTX 3090 or better recommended)

  • 1GB hugepages configured (minimum 36 pages)

  • Ubuntu 22.04 or compatible Linux distribution (gcc-11 required, doesn't need to be system-wide)

  • At least 256GB RAM, ALL MEMORY CHANNELS POPULATED

    • Without all memory channels populated sealing performance will suffer drastically

  • NUMA-Per-Socket (NPS) set to 1

Storage Recommendations

You need 2 sets of NVMe drives:

  1. Drives for layers:

    • Total 10-20M IOPS

    • Capacity for 11 x 32G x batchSize x pipelines

    • Raw unformatted block devices (SPDK will take them over)

    • Each drive should be able to sustain ~2GiB/s of writes

      • This requirement isn't understood well yet, it's possible that lower write rates are fine. More testing is needed.

  2. Drives for P2 output:

    • With a filesystem

    • Fast with sufficient capacity (~70G x batchSize x pipelines)

    • Can be remote storage if fast enough (~500MiB/s/GPU)

The following table shows the number of NVMe drives required for different batch sizes. The drive count column indicates N + M where N is the number of drives for layer data (SPDK) and M is the number of drives for P2 output (filesystem). The iops/drive column shows the minimum iops per drive required for the batch size. Batch size indicated with 2x means dual-pipeline drive setup. IOPS requirements are calculated simply by dividing total target 10M IOPS by the number of drives. In reality, depending on CPU core speed this may be too low or higher than neccesary. When ordering a system with barely enough IOPS plan to have free drive slots in case you need to add more drives later.

Batch Size
3.84TB
7.68TB
12.8TB
15.36TB
30.72TB

32

4 + 1

2 + 1

1 + 1

1 + 1

1 + 1

^ iops/drive

2500K

5000K

10000K

10000K

10000K

64 (2x 32)

7 + 2

4 + 1

2 + 1

2 + 1

1 + 1

^ iops/drive

1429K

2500K

5000K

5000K

10000K

128 (2x 64)

13 + 3

7 + 2

4 + 1

4 + 1

2 + 1

^ iops/drive

770K

1429K

2500K

2500K

5000K

2x 128

26 + 6

13 + 3

8 + 2

7 + 2

4 + 1

^ iops/drive

385K

770K

1250K

1429K

2500K

Hardware Recommendations

Currently, the community is trying to determine the best hardware configurations for batch sealing. Some general observations are:

  • Single socket systems will be easier to use at full capacity

  • You want a lot of NVMe slots, on PCIe Gen4 platforms with large batch sizes you may use 20-24 3.84TB NVMe drives

  • In general you'll want to make sure all memory channels are populated

  • You need 4~8 physical cores (not threads) for batch-wide compute, then on each CCX you'll lose 1 core for a "coordinator"

    • Each thread computes 2 sectors

    • On zen2 and earlier hashers compute only one sector per thread

    • Large (many-core) CCX-es are typically better

Please consider contributing to the SupraSeal hardware examples.

Setup

Check NUMA setup:

You should expect to see available: 1 nodes (0). If you see more than one node you need to go into your UEFI and set NUMA Per Socket (or a similar setting) to 1.

Configure hugepages:

This can be done by adding the following to /etc/default/grub. You need 36 1G hugepages for the batch sealer.

Then run sudo update-grub and reboot the machine.

Or at runtime:

Then check /proc/meminfo to verify the hugepages are available:

Expect output like:

Check that HugePages_Free is equal to 36, the kernel can sometimes use some of the hugepages for other purposes.

Dependencies

CUDA 12.x is required, 11.x won't work. The build process depends on GCC 11.x system-wide or gcc-11/g++-11 installed locally.

  • On Arch install https://aur.archlinux.org/packages/gcc11

  • Ubuntu 22.04 has GCC 11.x by default

  • On newer Ubuntu install gcc-11 and g++-11 packages

  • In addtion to general build dependencies (listed on the installation page), you need libgmp-dev and libconfig++-dev

Building

Build and install the batch-capable Curio binary:

For calibnet

Setup NVMe devices for SPDK:

This is only needed while batch sealing is in beta, future versions of Curio will handle this automatically.

Benchmark NVME IOPS

Please make sure to benchmark the raw NVME IOPS before proceeding with further configuration to verify that IOPS requirements are fulfilled.

The output should look like below

With ideally >10M IOPS total for all devices.

PC2 output storage

Attach scratch space storage for PC2 output (batch sealer needs ~70GB per sector in batch - 32GiB for the sealed sector, and 36GiB for the cache directory with TreeC/TreeR and aux files)

Usage

  1. Start the Curio node with the batch sealer layer

  1. Add a batch of CC sectors:

  1. Monitor progress - you should see a "Batch..." task running in the Curio GUI

  2. PC1 will take 3.5-5 hours, followed by PC2 on GPU

  3. After batch completion, the storage will be released for the next batch

Configuration

  • Run curio calc batch-cpu on the target machine to determine supported batch sizes for your CPU

Example batch-cpu output
  • Create a new layer configuration for the batch sealer, e.g. batch-machine1:

Optimization

  • Balance batch size, CPU cores, and NVMe drives to keep PC1 running constantly

  • Ensure sufficient GPU capacity to complete PC2 before next PC1 batch finishes

  • Monitor CPU, GPU and NVMe utilization to identify bottlenecks

  • Monitor hasher core utilisation

Troubleshooting

Node doesn't start / isn't visible in the UI

  • Ensure hugepages are configured correctly

  • Check NVMe device IOPS and capacity

    • If spdk setup fails, try to wipefs -a the NVMe devices (this will wipe partitions from the devices, be careful!)

Performance issues

You can monitor performance by looking at "hasher" core utilisation in e.g. htop.

To identify hasher cores, call curio calc supraseal-config --batch-size 128 (with the correct batch size), and look for coordinators

In this example, cores 59, 64, 72, 80, and 88 are "coordinators", with two hashers per core, meaning that

  • In first group core 59 is a coordinator, cores 60-63 are hashers (4 hasher cores / 8 hasher threads)

  • In second group core 64 is a coordinator, cores 65-71 are hashers (7 hasher cores / 14 hasher threads)

  • And so on

Coordinator cores will usually sit at 100% utilisation, hasher threads SHOULD sit at 100% utilisation, anything less indicates a bottleneck in the system, like not enough NVMe IOPS, not enough Memory bandwidth, or incorrect NUMA setup.

To troubleshoot:

  • Read the requirements at the top of this page very carefully

  • Validate GPU setup if PC2 is slow

  • Review logs for any errors during batch processing

Slower than expeted NVMe speed

If the NVME Benchmark shows lower than expected IOPS, you can try formatting the NVMe devices with SPDK:

Go through the menus like this

Then you might see a difference in performance like this:

Last updated