Trait ComputeLayerExecutor

Source

pub trait ComputeLayerExecutor<F: Field> {
    type ExprEval: Sync;
    type DevMem: ComputeMemory<F>;
    type OpValue: Send;
    type KernelExec: KernelExecutor<F, ExprEval = Self::ExprEval>;

    // Required methods
    fn accumulate_kernels(
        &mut self,
        map: impl Sync + for<'a> Fn(&'a mut Self::KernelExec, usize, Vec<KernelBuffer<'a, F, <Self::KernelExec as KernelExecutor<F>>::Mem>>) -> Result<Vec<<Self::KernelExec as KernelExecutor<F>>::Value>, Error>,
        mem_maps: Vec<KernelMemMap<'_, F, Self::DevMem>>,
    ) -> Result<Vec<Self::OpValue>, Error>;
    fn map_kernels(
        &mut self,
        map: impl Sync + for<'a> Fn(&'a mut Self::KernelExec, usize, Vec<KernelBuffer<'a, F, <Self::KernelExec as KernelExecutor<F>>::Mem>>) -> Result<(), Error>,
        mem_maps: Vec<KernelMemMap<'_, F, Self::DevMem>>,
    ) -> Result<(), Error>;
    fn inner_product(
        &mut self,
        a_in: SubfieldSlice<'_, F, Self::DevMem>,
        b_in: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>,
    ) -> Result<Self::OpValue, Error>;
    fn tensor_expand(
        &mut self,
        log_n: usize,
        coordinates: &[F],
        data: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>,
    ) -> Result<(), Error>;
    fn fold_left(
        &mut self,
        mat: SubfieldSlice<'_, F, Self::DevMem>,
        vec: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>,
        out: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>,
    ) -> Result<(), Error>;
    fn fold_right(
        &mut self,
        mat: SubfieldSlice<'_, F, Self::DevMem>,
        vec: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>,
        out: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>,
    ) -> Result<(), Error>;
    fn fri_fold<FSub>(
        &mut self,
        ntt: &(impl AdditiveNTT<FSub> + Sync),
        log_len: usize,
        log_batch_size: usize,
        challenges: &[F],
        data_in: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>,
        data_out: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>,
    ) -> Result<(), Error>
       where FSub: BinaryField,
             F: ExtensionField<FSub>;
    fn extrapolate_line(
        &mut self,
        evals_0: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>,
        evals_1: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>,
        z: F,
    ) -> Result<(), Error>;
    fn compute_composite(
        &mut self,
        inputs: &SlicesBatch<<Self::DevMem as ComputeMemory<F>>::FSlice<'_>>,
        output: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>,
        composition: &Self::ExprEval,
    ) -> Result<(), Error>;
    fn pairwise_product_reduce(
        &mut self,
        input: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>,
        round_outputs: &mut [<Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>],
    ) -> Result<(), Error>;

    // Provided methods
    fn join<Out1: Send, Out2: Send>(
        &mut self,
        op1: impl Send + FnOnce(&mut Self) -> Result<Out1, Error>,
        op2: impl Send + FnOnce(&mut Self) -> Result<Out2, Error>,
    ) -> Result<(Out1, Out2), Error> { ... }
    fn map<Out: Send, I: ExactSizeIterator<Item: Send> + Send>(
        &mut self,
        iter: I,
        map: impl Sync + Fn(&mut Self, I::Item) -> Result<Out, Error>,
    ) -> Result<Vec<Out>, Error> { ... }
}

Expand description

An interface for executing a sequence of operations on an accelerated compute device

This component defines a sequence of accelerated data transformations that must appear to execute in order on the selected compute device. Implementations may defer execution of any component of the defined sequence under the condition that the store-to-load ordering of the appended transformations is preserved.

The root ComputeLayerExecutor is obtained from ComputeLayer::execute. Nested instances for parallel and sequential blocks can be obtained via ComputeLayerExecutor::join and ComputeLayerExecutor::map respectively.

Required Associated Types§

Source

type ExprEval: Sync

The evaluator for arithmetic expressions (polynomials).

Source

type DevMem: ComputeMemory<F>

The device memory.

Source

type OpValue: Send

The operation (scalar) value type.

Source

type KernelExec: KernelExecutor<F, ExprEval = Self::ExprEval>

The executor that can execute operations on a kernel-level granularity (i.e., a single core).

Required Methods§

Source

fn accumulate_kernels( &mut self, map: impl Sync + for<'a> Fn(&'a mut Self::KernelExec, usize, Vec<KernelBuffer<'a, F, <Self::KernelExec as KernelExecutor<F>>::Mem>>) -> Result<Vec<<Self::KernelExec as KernelExecutor<F>>::Value>, Error>, mem_maps: Vec<KernelMemMap<'_, F, Self::DevMem>>, ) -> Result<Vec<Self::OpValue>, Error>

Launch many kernels in parallel and accumulate the scalar results with field addition.

This method provides low-level access to schedule parallel kernel executions on the compute platform. A kernel is a program that executes synchronously in one thread, with access to local memory buffers. When the environment launches a kernel, it sets up the kernel’s local memory according to the memory mapping specifications provided by the mem_maps parameter. The mapped buffers have type KernelBuffer, and they may be read-write or read-only. When the kernel exits, it returns a small number of computed values as field elements. The vector of returned scalars is accumulated via binary field addition across all kernels and returned from the call.

This method is fairly general but also designed to fit the specific needs of the sumcheck protocol. That motivates the choice that returned values are small-length vectors that are accumulated with field addition.

§Buffer chunking

The kernel local memory buffers are thought of as slices of a larger buffer, which may or may not exist somewhere else. Each kernel operates on a chunk of the larger buffers. For example, the KernelMemMap::Chunked mapping specifies that each kernel operates on a read-only chunk of a buffer in global device memory. The KernelMemMap::Local mapping specifies that a kernel operates on a local scratchpad initialized with zeros and discarded at the end of kernel execution (sort of like /dev/null).

This ComputeLayer object can decide how many kernels to launch and thus how large each kernel’s buffer chunks are. The number of chunks must be a power of two. This information is provided to the kernel specification closure as an argument.

§Kernel specification

The kernel logic is constructed within a closure, which is the map parameter. The closure has three parameters:

kernel_exec - the kernel execution environment.
log_chunks - the binary logarithm of the number of kernels that are launched.
buffers - a vector of kernel-local buffers.

The closure must respect certain assumptions:

The kernel closure control flow is identical on each invocation when log_chunks is unchanged.

ComputeLayer implementations are free to call the specification closure multiple times, for example with different values for log_chunks.

§Arguments

map - the kernel specification closure. See the “Kernel specification” section above.
mem_maps - the memory mappings for the kernel-local buffers.

Source

fn map_kernels( &mut self, map: impl Sync + for<'a> Fn(&'a mut Self::KernelExec, usize, Vec<KernelBuffer<'a, F, <Self::KernelExec as KernelExecutor<F>>::Mem>>) -> Result<(), Error>, mem_maps: Vec<KernelMemMap<'_, F, Self::DevMem>>, ) -> Result<(), Error>

Launch many kernels in parallel to process buffers without accumulating results.

Similar to Self::accumulate_kernels, this method provides low-level access to schedule parallel kernel executions on the compute platform. The key difference is that this method is focused on performing parallel operations on buffers without a reduction phase. Each kernel operates on its assigned chunk of data and writes its results directly to the mutable buffers provided in the memory mappings.

This method is suitable for operations where you need to transform data in parallel without aggregating results, such as element-wise transformations of large arrays.

§Buffer chunking

The kernel local memory buffers follow the same chunking approach as Self::accumulate_kernels. Each kernel operates on a chunk of the larger buffers as specified by the memory mappings.

§Kernel specification

The kernel logic is constructed within a closure, which is the map parameter. The closure has three parameters:

kernel_exec - the kernel execution environment.
log_chunks - the binary logarithm of the number of kernels that are launched.
buffers - a vector of kernel-local buffers.

Unlike Self::accumulate_kernels, this method does not expect the kernel to return any values for accumulation. Instead, the kernel should write its results directly to the mutable buffers provided in the buffers parameter.

The closure must respect certain assumptions:

The kernel closure control flow is identical on each invocation when log_chunks is unchanged.

ComputeLayer implementations are free to call the specification closure multiple times, for example with different values for log_chunks.

§Arguments

map - the kernel specification closure. See the “Kernel specification” section above.
mem_maps - the memory mappings for the kernel-local buffers.

Source

fn inner_product( &mut self, a_in: SubfieldSlice<'_, F, Self::DevMem>, b_in: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, ) -> Result<Self::OpValue, Error>

Returns the inner product of a vector of subfield elements with big field elements.

§Arguments

a_in - the first input slice of subfield elements.
b_in - the second input slice of F elements.

§Throws

if tower_level or a_in is greater than F::TOWER_LEVEL
unless a_in and b_in contain the same number of elements, and the number is a power of two

§Returns

Returns the inner product of a_in and b_in.

Source

fn tensor_expand( &mut self, log_n: usize, coordinates: &[F], data: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, ) -> Result<(), Error>

Computes the iterative tensor product of the input with the given coordinates.

This operation modifies the data buffer in place.

§Mathematical Definition

This operation accepts parameters

$n \in \mathbb{N}$ (log_n),
$k \in \mathbb{N}$ (coordinates.len()),
$v \in L^{2^n}$ (data[..1 << log_n]),
$r \in L^k$ (coordinates),

and computes the vector

$$ v \otimes (1 - r_0, r_0) \otimes \ldots \otimes (1 - r_{k-1}, r_{k-1}) $$

§Throws

unless 2**(log_n + coordinates.len()) equals data.len()

Source

fn fold_left( &mut self, mat: SubfieldSlice<'_, F, Self::DevMem>, vec: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, out: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, ) -> Result<(), Error>

Computes left matrix-vector multiplication of a subfield matrix with a big field vector.

§Mathematical Definition

This operation accepts

$n \in \mathbb{N}$ (out.len()),
$m \in \mathbb{N}$ (vec.len()),
$M \in K^{n \times m}$ (mat),
$v \in K^m$ (vec),

and computes the vector $Mv$.

§Args

mat - a slice of elements from a subfield of F.
vec - a slice of F elements.
out - a buffer for the output vector of F elements.

§Throws

Returns an error if mat.len() does not equal vec.len() * out.len().
Returns an error if mat is not a subfield of F.

Source

fn fold_right( &mut self, mat: SubfieldSlice<'_, F, Self::DevMem>, vec: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, out: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, ) -> Result<(), Error>

Computes right matrix-vector multiplication of a subfield matrix with a big field vector.

§Mathematical Definition

This operation accepts

$n \in \mathbb{N}$ (vec.len()),
$m \in \mathbb{N}$ (out.len()),
$M \in K^{n \times m}$ (mat),
$v \in K^m$ (vec),

and computes the vector $((v’)M)’$. The prime denotes a transpose

§Args

mat - a slice of elements from a subfield of F.
vec - a slice of F elements.
out - a buffer for the output vector of F elements.

§Throws

Returns an error if mat.len() does not equal vec.len() * out.len().
Returns an error if mat is not a subfield of F.

Source

fn fri_fold<FSub>( &mut self, ntt: &(impl AdditiveNTT<FSub> + Sync), log_len: usize, log_batch_size: usize, challenges: &[F], data_in: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, data_out: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, ) -> Result<(), Error>
where FSub: BinaryField, F: ExtensionField<FSub>,

FRI-fold the interleaved codeword using the given challenges.

The FRI-fold operation folds a length $2^{n+b+\eta}$ vector of field elements into a length $2^n$ vector of field elements. $n$ is the log block length of the code, $b$ is the log batch size, and $b + \eta$ is the number of challenge elements. The operation has the following mathematical structure:

Split the challenge vector into two parts: $c_0$ with length $b$ and $c_1$ with length $\eta$.
Low fold the input data with the tensor expansion of $c_0.
Apply $\eta$ layers of the inverse additive NTT to the data.
Low fold the input data with the tensor expansion of $c_1.

The algorithm to perform steps 3 and 4 can be combined into a linear amount of work, whereas step 3 on its own would require $\eta$ independent linear passes.

See DP24, Section 4.2 for more details.

This operation writes the result out-of-place into an output buffer.

§Arguments

ntt - the NTT instance, used to look up the twiddle values.
log_len - $n + \eta$, the binary logarithm of the code length.
log_batch_size - $b$, the binary logarithm of the interleaved code batch size.
challenges - the folding challenges, with length $b + \eta$.
data_in - an input vector, with length $2^{n + b + \eta}$.
data_out - an output buffer, with length $2^n$.

Source

fn extrapolate_line( &mut self, evals_0: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, evals_1: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, z: F, ) -> Result<(), Error>

Extrapolates a line between a vector of evaluations at 0 and evaluations at 1.

Given two values $y_0, y_1$, this operation computes the value $y_z = y_0 + (y_1 - y_0) z$, which is the value of the line that interpolates $(0, y_0), (1, y_1)$ at $z$. This computes this operation in parallel over two vectors of big field elements of equal sizes.

The operation writes the result back in-place into the evals_0 buffer.

§Args

evals_0 - this is both an input and output buffer. As in input, it is populated with the values $y_0$, which are the line’s values at 0.
evals_1 - an input buffer with the values $y_1$, which are the line’s values at 1.
z - the scalar evaluation point.

§Throws

if evals_0 and evals_1 are not equal sizes.
if the sizes of evals_0 and evals_1 are not powers of two.

Source

fn compute_composite( &mut self, inputs: &SlicesBatch<<Self::DevMem as ComputeMemory<F>>::FSlice<'_>>, output: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, composition: &Self::ExprEval, ) -> Result<(), Error>

Computes the elementwise application of a compiled arithmetic expression to multiple input slices.

This operation applies the composition expression to each row of input values, where a row consists of one element from each input slice at the same index position. The results are stored in the output slice.

§Mathematical Definition

Given:

Multiple input slices $P_0, \ldots, P_{m-1}$, each of length $2^n$ elements
A composition function $C(X_0, \ldots, X_{m-1})$
An output slice $P_{\text{out}}$ of length $2^n$ elements

This operation computes:

$$ P_{\text{out}}[i] = C(P_0[i], \ldots, P_{m-1}[i]) \quad \forall i \in {0, \ldots, 2^n- 1} $$

§Arguments

inputs - A slice of input slices, where each slice contains field elements.
output - A mutable output slice where the results will be stored.
composition - The compiled arithmetic expression to apply.

§Throws

Returns an error if any input or output slice has a length that is not a power of two.
Returns an error if the input and output slices do not all have the same length.

Source

fn pairwise_product_reduce( &mut self, input: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, round_outputs: &mut [<Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>], ) -> Result<(), Error>

Reduces a slice of elements to a single value by recursively applying pairwise multiplication.

Given an input slice x of length n = 2^k for some integer k, this function computes the result:

$$ y = \prod_{i=0}^{n-1} x_i $$

However, instead of a flat left-to-right reduction, the computation proceeds in ⌈log₂(n)⌉ rounds, halving the number of elements each time:

Round 0: $$ x_{i,0} = x_{2i} \cdot x_{2i+1} \quad \text{for } i = 0 \ldots \frac{n}{2} - 1 $$
Round 1: $$ x_{i,1} = x_{2i,0} \cdot x_{2i+1,0} $$
…
Final round: $$ y = x_{0,k} = \prod_{i=0}^{n-1} x_i $$

This binary tree-style reduction is mathematically equivalent to the full product, but structured for efficient parallelization.

§Arguments

`input`` - A slice of input field elements provided to the first reduction round
round_outputs - A mutable slice of preallocated output field elements for each reduction round. round_outputs.len() must equal log₂(input.len()) - 1. The length of the FSlice at index i must equal input.len() / 2**(i + 1) for i in 0..round_outputs.len().

§Throws

Returns an error if the length of input is not a power of 2.
Returns an error if the length of input is less than 2 (no reductions are possible).
Returns an error if round_outputs.len() != log₂(input.len())
Returns an error if any element in round_outputs does not satisfy round_outputs[i].len() == input.len() / 2**(i + 1) for i in 0..round_outputs.len()

Provided Methods§

Source

fn join<Out1: Send, Out2: Send>( &mut self, op1: impl Send + FnOnce(&mut Self) -> Result<Out1, Error>, op2: impl Send + FnOnce(&mut Self) -> Result<Out2, Error>, ) -> Result<(Out1, Out2), Error>

Creates an operation that depends on the concurrent execution of two inner operations.

Source

fn map<Out: Send, I: ExactSizeIterator<Item: Send> + Send>( &mut self, iter: I, map: impl Sync + Fn(&mut Self, I::Item) -> Result<Out, Error>, ) -> Result<Vec<Out>, Error>

Creates an operation that depends on the concurrent execution of a sequence of operations.

Dyn Compatibility§

This trait is not dyn compatible.

In older versions of Rust, dyn compatibility was called "object safety", so this trait is not object safe.

Implementors§

Source §

Trait ComputeLayerExecutorCopy item path

Required Associated Types§

type ExprEval: Sync

type DevMem: ComputeMemory<F>

type OpValue: Send

type KernelExec: KernelExecutor<F, ExprEval = Self::ExprEval>

Required Methods§

§Buffer chunking

§Kernel specification

§Arguments

fn map_kernels( &mut self, map: impl Sync + for<'a> Fn(&'a mut Self::KernelExec, usize, Vec<KernelBuffer<'a, F, <Self::KernelExec as KernelExecutor<F>>::Mem>>) -> Result<(), Error>, mem_maps: Vec<KernelMemMap<'_, F, Self::DevMem>>, ) -> Result<(), Error>

§Buffer chunking

§Kernel specification

§Arguments

fn inner_product( &mut self, a_in: SubfieldSlice<'_, F, Self::DevMem>, b_in: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, ) -> Result<Self::OpValue, Error>

§Arguments

§Throws

§Returns

fn tensor_expand( &mut self, log_n: usize, coordinates: &[F], data: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, ) -> Result<(), Error>

§Mathematical Definition

§Throws

fn fold_left( &mut self, mat: SubfieldSlice<'_, F, Self::DevMem>, vec: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, out: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, ) -> Result<(), Error>

§Mathematical Definition

§Args

§Throws

fn fold_right( &mut self, mat: SubfieldSlice<'_, F, Self::DevMem>, vec: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, out: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, ) -> Result<(), Error>

§Mathematical Definition

§Args

§Throws

§Arguments

fn extrapolate_line( &mut self, evals_0: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, evals_1: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, z: F, ) -> Result<(), Error>

§Args

§Throws

fn compute_composite( &mut self, inputs: &SlicesBatch<<Self::DevMem as ComputeMemory<F>>::FSlice<'_>>, output: &mut <Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>, composition: &Self::ExprEval, ) -> Result<(), Error>

§Mathematical Definition

§Arguments

§Throws

fn pairwise_product_reduce( &mut self, input: <Self::DevMem as ComputeMemory<F>>::FSlice<'_>, round_outputs: &mut [<Self::DevMem as ComputeMemory<F>>::FSliceMut<'_>], ) -> Result<(), Error>

§Arguments

§Throws

Provided Methods§

fn join<Out1: Send, Out2: Send>( &mut self, op1: impl Send + FnOnce(&mut Self) -> Result<Out1, Error>, op2: impl Send + FnOnce(&mut Self) -> Result<Out2, Error>, ) -> Result<(Out1, Out2), Error>

fn map<Out: Send, I: ExactSizeIterator<Item: Send> + Send>( &mut self, iter: I, map: impl Sync + Fn(&mut Self, I::Item) -> Result<Out, Error>, ) -> Result<Vec<Out>, Error>

Dyn Compatibility§

Implementors§

impl<F: TowerTop> ComputeLayerExecutor<F> for CpuLayerExecutor<F>

type OpValue = F

type ExprEval = ArithCircuit<F>

type KernelExec = CpuKernelBuilder

type DevMem = CpuMemory

Trait ComputeLayerExecutor