Standard API 

Returns

The initialized state of the estimator.

sync(state, pmap_axis_name)[source]

Synchronizes across devices the state of the estimator.

Return type: BlockDiagonalCurvature.State

multiply_matpower(state, parameter_structured_vector, identity_weight, power, exact_power, use_cached, pmap_axis_name, norm_to_scale_identity_weight_per_block=None)[source]

Computes (CurvatureMatrix + identity_weight I)**power times vector.

Parameters

state (BlockDiagonalCurvature.State) – The state of the estimator.
parameter_structured_vector (utils.Params) – A vector in the same structure as the parameters of the model.
identity_weight (Union[Numeric, Sequence[Numeric]]) – Specifies the weight of the identity element that is added to the curvature matrix. This can be either a scalar value or a list/tuple of scalar in which case each value specifies the weight individually for each block.
power (Scalar) – The power to which you want to raise the matrix (EstimateCurvature + identity_weight I).
exact_power (bool) – When set to True the matrix power of EstimateCurvature + identity_weight I is computed exactly. Otherwise this method might use a cheaper approximation, which may vary across different blocks.
use_cached (bool) – Whether to use a cached (and possibly stale) version of the curvature matrix estimate.
pmap_axis_name (Optional[str]) – The name of any pmap axis, which will be used for aggregating any computed values over multiple devices, as well as parallelizing the computation over devices in a block-wise fashion.
norm_to_scale_identity_weight_per_block (Optional[str]) – The name of a norm to use to compute extra per-block scaling for identity_weight. See psd_matrix_norm() in utils/math.py for the definition of these.

Return type

utils.Params

Returns

A parameter structured vector containing the product.

block_eigenvalues(state, use_cached)[source]

Computes the eigenvalues for each block of the curvature estimator.

Parameters

state (BlockDiagonalCurvature.State) – The state of the estimator.
use_cached (bool) – Whether to use a cached versions of the eigenvalues or to use the most recent curvature estimates to compute them. The cached version are going to be at least as fresh as the last time you called update_cache() with eigenvalues=True.

Return type

Tuple[Array, …]

Returns

A tuple of arrays containing the eigenvalues for each block. The order of this tuple corresponds to the ordering of self.blocks. To understand which parameters correspond to which block you can call self.parameters_block_index.

eigenvalues(state, use_cached)[source]

Computes the eigenvalues of the curvature matrix.

Parameters

state (BlockDiagonalCurvature.State) – The state of the estimator.
use_cached (bool) – Whether to use a cached versions of the eigenvalues or to use the most recent curvature estimates to compute them. The cached version are going to be at least as fresh as the last time you called update_cache() with eigenvalues=True.

Return type

Array

Returns

A single array containing the eigenvalues of the curvature matrix.

update_curvature_matrix_estimate(state, ema_old, ema_new, batch_size, rng, func_args, estimation_mode=None)[source]

Updates the estimator’s curvature estimates.

Parameters

state (BlockDiagonalCurvature.State) – The state of the estimator to update.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size.
rng (PRNGKey) – A PRNGKey to be used for any potential sampling in the estimation process.
func_args (utils.FuncArgs) – A structure with the values of the inputs to the traced function (the tagged_func passed into the constructor) which to be used for the estimation process. Should have the same structure as the argument func_args passed in the constructor.
estimation_mode (Optional[str]) –
The type of curvature estimator to use. By default (e.g. if None) will use self.default_estimation_mode. One of:
- fisher_gradients - the basic estimation approach from the original K-FAC paper.
- fisher_curvature_prop - method which estimates the Fisher using self-products of random 1/-1 vectors times “half-factors” of the Fisher, as described here.
- fisher_exact - is the obvious generalization of Curvature Propagation to compute the exact Fisher (modulo any additional diagonal or Kronecker approximations) by looping over one-hot vectors for each coordinate of the output instead of using 1/-1 vectors. It is more expensive to compute than the other three options by a factor equal to the output dimension, roughly speaking.
- fisher_empirical - computes the ‘empirical’ Fisher information matrix (which uses the data’s distribution for the targets, as opposed to the true Fisher which uses the model’s distribution) and requires that each registered loss have specified targets.
- ggn_curvature_prop - Analogous to fisher_curvature_prop, but estimates the Generalized Gauss-Newton matrix (GGN).
- ggn_exact - Analogous to fisher_exact, but estimates the Generalized Gauss-Newton matrix (GGN).

Return type

Returns

The updated state.

update_cache(state, identity_weight, exact_powers, approx_powers, eigenvalues, pmap_axis_name, norm_to_scale_identity_weight_per_block=None)[source]

Updates the estimator cached values.

Parameters

state (BlockDiagonalCurvature.State) – The state of the estimator to update.
identity_weight (Union[Numeric, Sequence[Numeric]]) – Specified the weight of the identity element that is added to the curvature matrix. This can be either a scalar value or a list/tuple of scalar in which case each value specifies the weight individually for each block.
exact_powers (Optional[curvature_blocks.ScalarOrSequence]) – Specifies which exact matrix powers in the cache should be updated.
approx_powers (Optional[curvature_blocks.ScalarOrSequence]) – Specifies which approximate matrix powers in the cache should be updated.
eigenvalues (bool) – Specifies whether to update the cached eigenvalues of each block. If they have not been cached before, this will create an entry with them in the block’s cache.
pmap_axis_name (Optional[str]) – The name of any pmap axis, which will be used for aggregating any computed values over multiple devices, as well as parallelizing the computation over devices in a block-wise fashion.

Return type

Returns

The updated state.

to_diagonal_block_dense_matrix(state)[source]

Returns a tuple of arrays with explicit dense matrices of each block.

Return type: Tuple[Array, …]

to_dense_matrix(state)[source]

Returns an explicit dense array representing the curvature matrix.

Return type: Array

ExplicitExactCurvature

class kfac_jax.ExplicitExactCurvature(func, batch_index=1, default_estimation_mode=None, layer_tag_to_block_ctor=None, auto_register_tags=False, param_order=None, **kwargs)[source]

Explicit exact full curvature estimator class.

This class estimates the full curvature matrix by looping over the batch dimension of the input data and for each single example computes an estimate of the curvature matrix and then averages over all examples in the input data. This implies that the computation scales linearly (without parallelism) with the batch size. The class stores the estimated curvature as a dense matrix, hence its memory requirement is (number of parameters)^2. If estimation_mode is fisher_exact or ggn_exact then this would compute the exact curvature, but other modes are also supported. As a result of looping over the input data this class needs to know the index of the batch in the arguments to the model function and additionally, since the loop is achieved through indexing, each array leaf of that argument must have the same first dimension size, which will be interpreted as the batch size.

__init__(func, batch_index=1, default_estimation_mode=None, layer_tag_to_block_ctor=None, auto_register_tags=False, param_order=None, **kwargs)[source]

Initializes the curvature instance.

Parameters

func (utils.Func) – The model function, which should have at least one registered loss.
batch_index (int) – Specifies at which index of the inputs to func is the batch, representing data over which we average the curvature.
default_estimation_mode (Optional[str]) – The estimation mode which to use by default when calling self.update_curvature_matrix_estimate. If None this will be 'fisher_exact'.
layer_tag_to_block_ctor (Optional[Mapping[str, CurvatureBlockCtor]]) – An optional dict mapping tags to specific classes of block approximations, which to override the default ones.
auto_register_tags (bool) – This argument will be ignored since this subclass doesn’t use automatic registration.
param_order (Optional[Tuple[int]]) – An optional tuple of ints specifying the order of parameters (with the reference order being the one used by func). If not specified, the reference order is used. The parameter order will determine the order of blocks returned by to_diagonal_block_dense_matrix, and the order of the rows and columns of to_dense_matrix.
**kwargs (Any) – Addiional keyword arguments passed to the superclass BlockDiagonalCurvature.

update_curvature_matrix_estimate(state, ema_old, ema_new, batch_size, rng, func_args, estimation_mode=None)[source]

Updates the estimator’s curvature estimates.

Parameters

state (BlockDiagonalCurvature.State) – The state of the estimator to update.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size.
rng (PRNGKey) – A PRNGKey to be used for any potential sampling in the estimation process.
func_args (utils.FuncArgs) – A structure with the values of the inputs to the traced function (the tagged_func passed into the constructor) which to be used for the estimation process. Should have the same structure as the argument func_args passed in the constructor.
estimation_mode (Optional[str]) –
The type of curvature estimator to use. By default (e.g. if None) will use self.default_estimation_mode. One of:
- fisher_gradients - the basic estimation approach from the original K-FAC paper.
- fisher_curvature_prop - method which estimates the Fisher using self-products of random 1/-1 vectors times “half-factors” of the Fisher, as described here.
- fisher_exact - is the obvious generalization of Curvature Propagation to compute the exact Fisher (modulo any additional diagonal or Kronecker approximations) by looping over one-hot vectors for each coordinate of the output instead of using 1/-1 vectors. It is more expensive to compute than the other three options by a factor equal to the output dimension, roughly speaking.
- fisher_empirical - computes the ‘empirical’ Fisher information matrix (which uses the data’s distribution for the targets, as opposed to the true Fisher which uses the model’s distribution) and requires that each registered loss have specified targets.
- ggn_curvature_prop - Analogous to fisher_curvature_prop, but estimates the Generalized Gauss-Newton matrix (GGN).
- ggn_exact - Analogous to fisher_exact, but estimates the Generalized Gauss-Newton matrix (GGN).

Return type

Returns

The updated state.

ImplicitExactCurvature

class kfac_jax.ImplicitExactCurvature(func, params_index=0, batch_size_extractor=<function default_batch_size_extractor>)[source]

Represents all exact curvature matrices never constructed explicitly.

__init__(func, params_index=0, batch_size_extractor=<function default_batch_size_extractor>)[source]

Initializes the ImplicitExactCurvature instance.

Parameters

func (utils.Func) – The model function, which should have at least one registered loss.
params_index (int) – The index of the parameters argument in arguments list of func.
batch_size_extractor (Callable[[utils.Batch], Numeric]) – A function that takes as input the function arguments and returns the batch size for a single device. (Default: kfac.utils.default_batch_size_extractor)

batch_size(func_args)[source]

The expected batch size given a list of loss instances.

Return type: Numeric

multiply_hessian(func_args, parameter_structured_vector)[source]

Multiplies the vector with the Hessian matrix of the total loss.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function, on which to evaluate the Hessian matrix.
parameter_structured_vector (utils.Params) – The vector which to multiply with the Hessian matrix.

Return type

utils.Params

Returns

The product Hv.

multiply_jacobian(func_args, parameter_structured_vector, return_loss_objects=False)[source]

Multiplies a vector by the model’s Jacobian.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function.
parameter_structured_vector (utils.Params) – A vector in the same structure as the parameters of the model.
return_loss_objects (bool) – If set to True will return as an additional output the loss objects evaluated at the provided function arguments.

Return type

Union[LossFunctionInputsTuple, Tuple[LossFunctionInputsTuple, LossFunctionsTuple]]

Returns

The product J v, where J is the model’s Jacobian and v is given by parameter_structured_vector.

multiply_jacobian_transpose(func_args, loss_input_vectors, return_loss_objects=False)[source]

Multiplies a vector by the model’s transposed Jacobian.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function.
loss_input_vectors (LossFunctionInputsSequence) – A sequence over losses of sequences of arrays that are the size of the loss’s inputs. This represents the vector to be multiplied.
return_loss_objects (bool) – If set to True will return as an additional output the loss objects evaluated at the provided function arguments.

Return type

Union[utils.Params, Tuple[utils.Params, LossFunctionsTuple]]

Returns

The product J^T v, where J is the model’s Jacobian and v is given by loss_inner_vectors.

multiply_fisher(func_args, parameter_structured_vector)[source]

Multiplies the vector with the Fisher matrix of the total loss.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function, on which to evaluate the Fisher matrix.
parameter_structured_vector (utils.Params) – The vector which to multiply with the Fisher matrix.

Return type

utils.Params

Returns

The product Fv.

multiply_ggn(func_args, parameter_structured_vector)[source]

Multiplies the vector with the GGN matrix of the total loss.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function, on which to evaluate the GGN matrix.
parameter_structured_vector (utils.Params) – The vector which to multiply with the GGN matrix.

Return type

utils.Params

Returns

The product Gv.

multiply_fisher_factor_transpose(func_args, parameter_structured_vector)[source]

Multiplies the vector with the transposed factor of the Fisher matrix.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function, on which to evaluate the Fisher matrix.
parameter_structured_vector (utils.Params) – The vector which to multiply with the Fisher matrix.

Return type

Tuple[Array, …]

Returns

The product B^T v, where F = BB^T.

multiply_ggn_factor_transpose(func_args, parameter_structured_vector)[source]

Multiplies the vector with the transposed factor of the GGN matrix.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function, on which to evaluate the GGN matrix.
parameter_structured_vector (utils.Params) – The vector which to multiply with the GGN matrix.

Return type

Tuple[Array, …]

Returns

The product B^T v, where G = BB^T.

multiply_fisher_factor(func_args, loss_inner_vectors)[source]

Multiplies the vector with the factor of the Fisher matrix.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function, on which to evaluate the Fisher matrix.
loss_inner_vectors (Sequence[Array]) – The vector which to multiply with the Fisher factor matrix.

Return type

utils.Params

Returns

The product Bv, where F = BB^T.

multiply_ggn_factor(func_args, loss_inner_vectors)[source]

Multiplies the vector with the factor of the GGN matrix.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function, on which to evaluate the GGN matrix.
loss_inner_vectors (Sequence[Array]) – The vector which to multiply with the GGN factor matrix.

Return type

utils.Params

Returns

The product Bv, where G = BB^T.

get_loss_inner_vector_shapes_and_batch_size(func_args, mode)[source]

Get shapes of loss inner vectors, and the batch size.

Parameters

func_args (utils.FuncArgs) – The inputs to the model function.
mode (str) – A string representing the type of curvature matrix for the loss inner vectors. Can be “fisher” or “ggn”.

Return type

Tuple[Tuple[Shape, …], int]

Returns

Shapes of loss inner vectors in a tuple, and the batch size as an int.

get_loss_input_shapes_and_batch_size(func_args)[source]

Get shapes of loss input vectors, and the batch size.

Parameters: func_args (utils.FuncArgs) – The inputs to the model function.
Return type: Tuple[Tuple[Tuple[Shape, …], …], int]
Returns: A tuple over losses of tuples containing the shapes of their different inputs, and the batch size (as an int).

set_default_tag_to_block_ctor

kfac_jax.set_default_tag_to_block_ctor(tag_name, block_ctor)[source]

Sets the default curvature block constructor for the given tag.

Return type: None

get_default_tag_to_block_ctor

kfac_jax.get_default_tag_to_block_ctor(tag_name)[source]

Returns the default curvature block constructor for the give tag name.

Return type: Optional[CurvatureBlockCtor]

Loss Functions

`LossFunction`(weight)	Abstract base class for loss functions.
`NegativeLogProbLoss`(weight)	Base class for loss functions that represent negative log-probability.
`DistributionNegativeLogProbLoss`(weight)	Negative log-probability loss that uses a Distrax distribution.
`NormalMeanNegativeLogProbLoss`(mean[, ...])	Loss log prob loss for a normal distribution parameterized by a mean vector.
`NormalMeanVarianceNegativeLogProbLoss`(mean, ...)	Negative log prob loss for a normal distribution with mean and variance.
`MultiBernoulliNegativeLogProbLoss`(logits[, ...])	Negative log prob loss for multiple Bernoulli distributions parametrized by logits.
`CategoricalLogitsNegativeLogProbLoss`(logits)	Negative log prob loss for a categorical distribution parameterized by logits.
`OneHotCategoricalLogitsNegativeLogProbLoss`(logits)	Neg log prob loss for a categorical distribution with onehot targets.
`register_sigmoid_cross_entropy_loss`(logits)	Registers a sigmoid cross-entropy loss function.
`register_multi_bernoulli_predictive_distribution`(logits)	Registers a multi-Bernoulli predictive distribution.
`register_softmax_cross_entropy_loss`(logits)	Registers a softmax cross-entropy loss function.
`register_categorical_predictive_distribution`(logits)	Registers a categorical predictive distribution.
`register_squared_error_loss`(prediction[, ...])	Registers a squared error loss function.
`register_normal_predictive_distribution`(mean)	Registers a normal predictive distribution.

LossFunction

class kfac_jax.LossFunction(weight)[source]

Abstract base class for loss functions.

Note that unlike typical loss functions used in neural networks these are neither summed nor averaged over the batch and the output of evaluate() will not be a scalar. It is up to the user to then to correctly manipulate them as needed.

__init__(weight)[source]

Initializes the loss instance.

Parameters: weight (Numeric) – The relative weight attributed to the loss.

property weight: Numeric

The relative weight of the loss.

Return type: Numeric

abstract property targets: Optional[Array]

The targets (if present) used for evaluating the loss.

Return type: Optional[Array]

abstract property parameter_dependants: Tuple[Array, ...]

All the parameter dependent arrays of the loss.

Return type: Tuple[Array, …]

property num_parameter_dependants: int

Number of parameter dependent arrays of the loss.

Return type: int

abstract property parameter_independants: Tuple[Numeric, ...]

All the parameter independent arrays of the loss.

Return type: Tuple[Numeric, …]

property num_parameter_independants: int

Number of parameter independent arrays of the loss.

Return type: int

copy_with_different_inputs(parameter_dependants)[source]

Creates a copy of the loss function object, but with different inputs.

Return type: LossFunction

evaluate(targets=None, coefficient_mode='regular')[source]

Evaluates the loss function on the targets.

Parameters

targets (Optional[Array]) – The targets, on which to evaluate the loss. If this is set to None will use self.targets instead.
coefficient_mode (str) –
Specifies how to use the relative weight of the loss in the returned value. There are three options:
1. ’regular’ - returns self.weight * loss(targets)
2. ’sqrt’ - returns sqrt(self.weight) * loss(targets)
3. ’off’ - returns loss(targets)

Return type

Array

Returns

The value of the loss scaled appropriately by self.weight according to the coefficient mode.

Raises

ValueError if both targets and self.targets are None. –

grad_of_evaluate(targets, coefficient_mode)[source]

Evaluates the gradient of the loss function, w.r.t. its inputs.

Parameters

targets (Optional[Array]) – The targets at which to evaluate the loss. If this is None will use self.targets instead.
coefficient_mode (str) – The coefficient mode to use for evaluation. See self.evaluate for more details.

Return type

Tuple[Array, …]

Returns

The gradient of the loss function w.r.t. its inputs, at the provided targets.

multiply_ggn(vector)[source]

Right-multiplies a vector by the GGN of the loss function.

Here the GGN is the Generalized Gauss-Newton matrix (whose definition is somewhat flexible) of the loss function with respect to its inputs.

Parameters: vector (Sequence[Array]) – The vector to multiply. Must have the same shape(s) as self.inputs.
Return type: Tuple[Array, …]
Returns: The vector right-multiplied by the GGN. Will have the same shape(s) as self.inputs.

abstract multiply_ggn_unweighted(vector)[source]

Unweighted version of multiply_ggn().

Return type: Tuple[Array, …]

multiply_ggn_factor(vector)[source]

Right-multiplies a vector by a factor B of the GGN.

Here the GGN is the Generalized Gauss-Newton matrix (whose definition is somewhat flexible) of the loss function with respect to its inputs. Typically this will be block-diagonal across different cases in the batch, since the loss function is typically summed across cases.

Note that B can be any matrix satisfying B * B^T = G where G is the GGN, but will agree with the one used in the other methods of this class.

Parameters: vector (Array) – The vector to multiply. Must be of the shape(s) given by ‘self.ggn_factor_inner_shape’.
Return type: Tuple[Array, …]
Returns: The vector right-multiplied by B. Will be of the same shape(s) as self.inputs.

abstract multiply_ggn_factor_unweighted(vector)[source]

Unweighted version of multiply_ggn_factor().

Return type: Tuple[Array, …]

multiply_ggn_factor_transpose(vector)[source]

Right-multiplies a vector by the transpose of a factor B of the GGN.

Here the GGN is the Generalized Gauss-Newton matrix (whose definition is somewhat flexible) of the loss function with respect to its inputs. Typically this will be block-diagonal across different cases in the batch, since the loss function is typically summed across cases.

Note that B can be any matrix satisfying B * B^T = G where G is the GGN, but will agree with the one used in the other methods of this class.

Parameters: vector (Sequence[Array]) – The vector to multiply. Must have the same shape(s) as self.inputs.
Return type: Array
Returns: The vector right-multiplied by B^T. Will be of the shape(s) given by self.ggn_factor_inner_shape.

abstract multiply_ggn_factor_transpose_unweighted(vector)[source]

Unweighted version of multiply_ggn_factor_transpose().

Return type: Array

multiply_ggn_factor_replicated_one_hot(index)[source]

Right-multiplies a replicated-one-hot vector by a factor B of the GGN.

Here the GGN is the Generalized Gauss-Newton matrix (whose definition is somewhat flexible) of the loss function with respect to its inputs. Typically this will be block-diagonal across different cases in the batch, since the loss function is typically summed across cases.

A replicated-one-hot vector means a tensor which, for each slice along the batch dimension (assumed to be dimension 0), is 1.0 in the entry corresponding to the given index and 0 elsewhere.

Note that B can be any matrix satisfying B * B^T = G where G is the GGN, but will agree with the one used in the other methods of this class.

Parameters: index (Sequence[int]) – A tuple representing in the index of the entry in each slice that is 1.0. Note that len(index) must be equal to the number of elements of the ggn_factor_inner_shape tensor minus one.
Return type: Tuple[Array, …]
Returns: The vector right-multiplied by B^T. Will be of the same shape(s) as the inputs property.

abstract multiply_ggn_factor_replicated_one_hot_unweighted(index)[source]

Unweighted version of multiply_ggn_factor_replicated_one_hot().

Return type: Tuple[Array, …]

abstract property ggn_factor_inner_shape: Shape

The shape of the array returned by self.multiply_ggn_factor.

Return type: Shape

NegativeLogProbLoss

class kfac_jax.NegativeLogProbLoss(weight)[source]

Base class for loss functions that represent negative log-probability.

property parameter_dependants: Tuple[Array, ...]

All the parameter dependent arrays of the loss.

Return type: Tuple[Array, …]

abstract property params: Tuple[Array, ...]

Parameters to the underlying distribution.

Return type: Tuple[Array, …]

multiply_fisher(vector)[source]

Right-multiplies a vector by the Fisher.

Parameters: vector (Sequence[Array]) – The vector to multiply. Must have the same shape(s) as self.inputs.
Return type: Tuple[Array, …]
Returns: The vector right-multiplied by the Fisher. Will have of the same shape(s) as self.inputs.

abstract multiply_fisher_unweighted(vector)[source]

Unweighted version of multiply_fisher().

Return type: Tuple[Array, …]

multiply_fisher_factor(vector)[source]

Right-multiplies a vector by a factor B of the Fisher.

Here the Fisher is the Fisher information matrix (i.e. expected outer- product of gradients) with respect to the parameters of the underlying probability distribution (whose log-prob defines the loss). Typically this will be block-diagonal across different cases in the batch, since the distribution is usually (but not always) conditionally iid across different cases.

Note that B can be any matrix satisfying B * B^T = F where F is the Fisher, but will agree with the one used in the other methods of this class.

Parameters: vector (Array) – The vector to multiply. Must have the same shape(s) as self.fisher_factor_inner_shape.
Return type: Tuple[Array, …]
Returns: The vector right-multiplied by B. Will have the same shape(s) as self.inputs.

abstract multiply_fisher_factor_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor().

Return type: Tuple[Array, …]

multiply_fisher_factor_transpose(vector)[source]

Right-multiplies a vector by the transpose of a factor B of the Fisher.

Here the Fisher is the Fisher information matrix (i.e. expected outer- product of gradients) with respect to the parameters of the underlying probability distribution (whose log-prob defines the loss). Typically this will be block-diagonal across different cases in the batch, since the distribution is usually (but not always) conditionally iid across different cases.

Note that B can be any matrix satisfying B * B^T = F where F is the Fisher, but will agree with the one used in the other methods of this class.

Parameters: vector (Sequence[Array]) – The vector to multiply. Must have the same shape(s) as self.inputs.
Return type: Array
Returns: The vector right-multiplied by B^T. Will have the shape given by self.fisher_factor_inner_shape.

abstract multiply_fisher_factor_transpose_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor_transpose().

Return type: Array

multiply_fisher_factor_replicated_one_hot(index)[source]

Right-multiplies a replicated-one-hot vector by a factor B of the Fisher.

Here the Fisher is the Fisher information matrix (i.e. expected outer- product of gradients) with respect to the parameters of the underlying probability distribution (whose log-prob defines the loss). Typically this will be block-diagonal across different cases in the batch, since the distribution is usually (but not always) conditionally iid across different cases.

A replicated-one-hot vector means a tensor which, for each slice along the batch dimension (assumed to be dimension 0), is 1.0 in the entry corresponding to the given index and 0 elsewhere.

Note that B can be any matrix satisfying B * B^T = H where H is the Fisher, but will agree with the one used in the other methods of this class.

Parameters: index (Sequence[int]) – A tuple representing in the index of the entry in each slice that is 1.0. Note that len(index) must be equal to the number of elements of the fisher_factor_inner_shape tensor minus one.
Return type: Tuple[Array, …]
Returns: The vector right-multiplied by B. Will have the same shape(s) as self.inputs.

abstract multiply_fisher_factor_replicated_one_hot_unweighted(index)[source]

Unweighted version of multiply_fisher_factor_replicated_one_hot().

Return type: Tuple[Array, …]

abstract property fisher_factor_inner_shape: Shape

The shape of the array returned by multiply_fisher_factor().

Return type: Shape

abstract sample(rng)[source]

Sample targets from the underlying distribution.

Return type: Array

grad_of_evaluate_on_sample(rng, coefficient_mode)[source]

Evaluates the gradient of the log probability on a random sample.

Parameters

rng (Array) – Jax PRNG key for sampling.
coefficient_mode (str) – The coefficient mode to use for evaluation.

Return type

Tuple[Array, …]

Returns

The gradient of the log probability of targets sampled from the distribution.

DistributionNegativeLogProbLoss

class kfac_jax.DistributionNegativeLogProbLoss(weight)[source]

Negative log-probability loss that uses a Distrax distribution.

abstract property dist: distrax.Distribution

The underlying Distrax distribution.

Return type: distrax.Distribution

sample(rng)[source]

Sample targets from the underlying distribution.

Return type: Array

property fisher_factor_inner_shape: Shape

The shape of the array returned by multiply_fisher_factor().

Return type: Shape

NormalMeanNegativeLogProbLoss

class kfac_jax.NormalMeanNegativeLogProbLoss(mean, targets=None, variance=0.5, weight=1.0, normalize_log_prob=True)[source]

Loss log prob loss for a normal distribution parameterized by a mean vector.

Note that the covariance is treated as the identity divided by 2. Also note that the Fisher for such a normal distribution with respect the mean parameter is given by:

F = (1 / variance) * I

See for example https://www.ii.pwr.edu.pl/~tomczak/PDF/[JMT]Fisher_inf.pdf.

__init__(mean, targets=None, variance=0.5, weight=1.0, normalize_log_prob=True)[source]

Initializes the loss instance.

Parameters

mean (Array) – The mean of the normal distribution.
targets (Optional[Array]) – Optional targets to use for evaluation.
variance (Numeric) – The scalar variance of the normal distribution.
weight (Numeric) – The relative weight of the loss.
normalize_log_prob (bool) – Whether the log prob should include the standard normalization constant for Gaussians (which is additive and depends on the variance).

property targets: Optional[Array]

The targets (if present) used for evaluating the loss.

Return type: Optional[Array]

property parameter_independants: Tuple[Numeric, ...]

All the parameter independent arrays of the loss.

Return type: Tuple[Numeric, …]

property dist: distrax.MultivariateNormalDiag

The underlying Distrax distribution.

Return type: distrax.MultivariateNormalDiag

property params: Tuple[Array]

Parameters to the underlying distribution.

Return type: Tuple[Array]

multiply_fisher_unweighted(vector)[source]

Unweighted version of multiply_fisher().

Return type: Tuple[Array]

multiply_fisher_factor_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor().

Return type: Tuple[Array]

multiply_fisher_factor_transpose_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor_transpose().

Return type: Array

multiply_fisher_factor_replicated_one_hot_unweighted(index)[source]

Unweighted version of multiply_fisher_factor_replicated_one_hot().

Return type: Tuple[Array]

NormalMeanVarianceNegativeLogProbLoss

class kfac_jax.NormalMeanVarianceNegativeLogProbLoss(mean, variance, targets=None, weight=1.0)[source]

Negative log prob loss for a normal distribution with mean and variance.

This class parameterizes a multivariate normal distribution with n independent dimensions. Unlike NormalMeanNegativeLogProbLoss, this class does not assume the variance is held constant. The Fisher Information for n = 1 is given by:

F = [[1 / variance, 0],: [ 0, 0.5 / variance^2]]

where the parameters of the distribution are concatenated into a single vector as [mean, variance]. For n > 1, the mean parameter vector is concatenated with the variance parameter vector. For further details checkout the Wikipedia page.

__init__(mean, variance, targets=None, weight=1.0)[source]

Initializes the loss instance.

Parameters

mean (Array) – The mean of the normal distribution.
variance (Array) – The variance of the normal distribution.
targets (Optional[Array]) – Optional targets to use for evaluation.
weight (Numeric) – The relative weight of the loss.

property targets: Optional[Array]

The targets (if present) used for evaluating the loss.

Return type: Optional[Array]

property parameter_independants: Tuple[Numeric, ...]

All the parameter independent arrays of the loss.

Return type: Tuple[Numeric, …]

property dist: distrax.MultivariateNormalDiag

The underlying Distrax distribution.

Return type: distrax.MultivariateNormalDiag

property params: Tuple[Array, Array]

Parameters to the underlying distribution.

Return type: Tuple[Array, Array]

multiply_fisher_unweighted(vector)[source]

Unweighted version of multiply_fisher().

Return type: Tuple[Array, Array]

multiply_fisher_factor_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor().

Return type: Tuple[Array, Array]

multiply_fisher_factor_transpose_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor_transpose().

Return type: Array

multiply_fisher_factor_replicated_one_hot_unweighted(index)[source]

Unweighted version of multiply_fisher_factor_replicated_one_hot().

Return type: Tuple[Array, Array]

property fisher_factor_inner_shape: Shape

The shape of the array returned by multiply_fisher_factor().

Return type: Shape

multiply_ggn_unweighted(vector)[source]

Unweighted version of multiply_ggn().

Return type: Tuple[Array, …]

multiply_ggn_factor_unweighted(vector)[source]

Unweighted version of multiply_ggn_factor().

Return type: Tuple[Array, …]

multiply_ggn_factor_transpose_unweighted(vector)[source]

Unweighted version of multiply_ggn_factor_transpose().

Return type: Array

multiply_ggn_factor_replicated_one_hot_unweighted(index)[source]

Unweighted version of multiply_ggn_factor_replicated_one_hot().

Return type: Tuple[Array, …]

property ggn_factor_inner_shape: Shape

The shape of the array returned by self.multiply_ggn_factor.

Return type: Shape

MultiBernoulliNegativeLogProbLoss

class kfac_jax.MultiBernoulliNegativeLogProbLoss(logits, targets=None, weight=1.0)[source]

Negative log prob loss for multiple Bernoulli distributions parametrized by logits.

Represents N independent Bernoulli distributions where N = len(logits). Its Fisher Information matrix is given by F = diag(p * (1-p)), where p = sigmoid(logits).

As F is diagonal with positive entries, its factor B is B = diag(sqrt(p * (1-p))).

__init__(logits, targets=None, weight=1.0)[source]

Initializes the loss instance.

Parameters

logits (Array) – The logits of the Bernoulli distribution.
targets (Optional[Array]) – Optional targets to use for evaluation.
weight (Numeric) – The relative weight of the loss.

property targets: Optional[Array]

The targets (if present) used for evaluating the loss.

Return type: Optional[Array]

property parameter_independants: Tuple[Numeric, ...]

All the parameter independent arrays of the loss.

Return type: Tuple[Numeric, …]

property dist: distrax.Bernoulli

The underlying Distrax distribution.

Return type: distrax.Bernoulli

property params: Tuple[Array]

Parameters to the underlying distribution.

Return type: Tuple[Array]

multiply_fisher_unweighted(vector)[source]

Unweighted version of multiply_fisher().

Return type: Tuple[Array]

multiply_fisher_factor_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor().

Return type: Tuple[Array]

multiply_fisher_factor_transpose_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor_transpose().

Return type: Array

multiply_fisher_factor_replicated_one_hot_unweighted(index)[source]

Unweighted version of multiply_fisher_factor_replicated_one_hot().

Return type: Tuple[Array]

CategoricalLogitsNegativeLogProbLoss

class kfac_jax.CategoricalLogitsNegativeLogProbLoss(logits, targets=None, mask=None, weight=1.0)[source]

Negative log prob loss for a categorical distribution parameterized by logits.

Note that the Fisher (for a single case) of a categorical distribution, with respect to the natural parameters (i.e. the logits), is given by F = diag(p) - p*p^T, where p = softmax(logits). F can be factorized as F = B * B^T, where B = diag(q) - p*q^T and q is the entry-wise square root of p. This is easy to verify using the fact that q^T*q = 1 .

__init__(logits, targets=None, mask=None, weight=1.0)[source]

Initializes the loss instance.

Parameters

logits (Array) – The logits of the Categorical distribution of shape (batch_size, output_size).
targets (Optional[Array]) – Optional targets to use for evaluation, which specify an integer index of the correct class. Must be of shape (batch_size,).
mask (Optional[Array]) – Optional mask to apply to losses over the batch. Should be 0/1-valued and of shape (batch_size,). The tensors returned by evaluate and grad_of_evaluate, as well as the various matrix vector products, will be multiplied by mask (with broadcasting to later dimensions).
weight (Numeric) – The relative weight of the loss.

property targets: Optional[Array]

The targets (if present) used for evaluating the loss.

Return type: Optional[Array]

property parameter_independants: Tuple[Numeric, ...]

All the parameter independent arrays of the loss.

Return type: Tuple[Numeric, …]

property dist: distrax.Categorical

The underlying Distrax distribution.

Return type: distrax.Categorical

property params: Tuple[Array]

Parameters to the underlying distribution.

Return type: Tuple[Array]

property fisher_factor_inner_shape: Shape

The shape of the array returned by multiply_fisher_factor().

Return type: Shape

multiply_fisher_unweighted(vector)[source]

Unweighted version of multiply_fisher().

Return type: Tuple[Array]

multiply_fisher_factor_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor().

Return type: Tuple[Array]

multiply_fisher_factor_transpose_unweighted(vector)[source]

Unweighted version of multiply_fisher_factor_transpose().

Return type: Array

multiply_fisher_factor_replicated_one_hot_unweighted(index)[source]

Unweighted version of multiply_fisher_factor_replicated_one_hot().

Return type: Tuple[Array]

OneHotCategoricalLogitsNegativeLogProbLoss

class kfac_jax.OneHotCategoricalLogitsNegativeLogProbLoss(logits, targets=None, mask=None, weight=1.0)[source]

Neg log prob loss for a categorical distribution with onehot targets.

Identical to CategoricalLogitsNegativeLogProbLoss except that the underlying distribution is OneHotCategorical as opposed to Categorical.

property dist: distrax.OneHotCategorical

The underlying Distrax distribution.

Return type: distrax.OneHotCategorical

register_sigmoid_cross_entropy_loss

kfac_jax.register_sigmoid_cross_entropy_loss(logits, targets=None, weight=1.0)[source]

Registers a sigmoid cross-entropy loss function.

This assumes a sigmoid cross-entropy loss of the form weight * jnp.sum(sigmoid_cross_entropy(logits, targets)) / batch_size.

NOTE: this function assumes you are not averaging over non-batch dimensions when computing the loss. e.g. if dimension 0 were the batch dimension, this corresponds to ``jnp.mean(jnp.sum(sigmoid_cross_entropy(logits, targets),

axis=range(1, target.ndims)), axis=0)``

and not jnp.mean(sigmoid_cross_entropy(logits, targets)) If your loss is of the latter form you can compensate for this by passing the appropriate value to weight.

NOTE: this function is distinct from register_softmax_cross_entropy_loss() and should not be confused with it. It is similar to register_multi_bernoulli_predictive_distribution() but without the explicit probabilistic interpretation. It behaves identically for now.

Parameters

logits (Array) – The input logits of the loss as a 2D array of floats. The first dimension will usually be the batch size, but doesn’t need to be (unless using estimation_mode='fisher_exact' or estimation_mode='ggn_exact' in the optimizer/estimator).
targets (Optional[Array]) – (OPTIONAL) The targets for the loss function. Must be of the same shape as logits. Only required if using estimation_mode='fisher_empirical' in the optimizer/estimator. (Default: None)
weight (Numeric) – The constant scalar coefficient which this loss is multiplied by. Note that this must be constant and independent of the network’s parameters. (Default: 1.0)

register_multi_bernoulli_predictive_distribution

kfac_jax.register_multi_bernoulli_predictive_distribution(logits, targets=None, weight=1.0)[source]

Registers a multi-Bernoulli predictive distribution.

This corresponds to a sigmoid cross-entropy loss of the form weight * jnp.sum(sigmoid_cross_entropy(logits, targets)) / batch_size.

NOTE: this function assumes you are not averaging over non-batch dimensions when computing the loss. e.g. if dimension 0 were the batch dimension, this corresponds to ``jnp.mean(jnp.sum(sigmoid_cross_entropy(logits, targets),

axis=range(1, target.ndims)), axis=0)``

and not jnp.mean(sigmoid_cross_entropy(logits, targets)) If your loss is of the latter form you can compensate for it by passing the appropriate value to weight.

NOTE: this is distinct from register_categorical_predictive_distribution() and should not be confused with it.

Parameters

logits (Array) – The logits of the distribution (i.e. its parameters) as a 2D array of floats. The first dimension will usually be the batch size, but doesn’t need to be (unless using estimation_mode='fisher_exact' or estimation_mode='ggn_exact' in the optimizer/estimator).
targets (Optional[Array]) – (OPTIONAL) The targets for the loss function. Only required if using estimation_mode='fisher_empirical' in the optimizer/estimator. (Default: None)
weight (Numeric) – The constant scalar coefficient that the log prob loss associated with this distribution is multiplied by. This is NOT equivalent to changing the temperature of the distribution since we don’t renormalize the log prob in the objective function. Note that this must be constant and independent of the network’s parameters. (Default: 1.0)

register_softmax_cross_entropy_loss

kfac_jax.register_softmax_cross_entropy_loss(logits, targets=None, mask=None, weight=1.0)[source]

Registers a softmax cross-entropy loss function.

This assumes a softmax cross-entropy loss of the form

weight * jnp.sum(softmax_cross_entropy(logits, targets)) / batch_size.

NOTE:this is distinct from register_sigmoid_cross_entropy_loss() and should not be confused with it. It is similar to register_categorical_predictive_distribution() but without the explicit probabilistic interpretation. It behaves identically for now.

Parameters

logits (Array) – The input logits of the loss as a 2D array of floats. The first dimension will usually be the batch size, but doesn’t need to be (unless using estimation_mode='fisher_exact' or estimation_mode='ggn_exact' in the optimizer/estimator). The second dimension is the one over which the softmax is computed.
targets (Optional[Array]) – (OPTIONAL) The targets for the loss function. Must be a 1D array of integers with shape (logits.shape[0],). Only required if using estimation_mode='fisher_empirical' in the optimizer/estimator. (Default: None)
mask (Optional[Array]) – (OPTIONAL) Mask to apply to losses. Should be 0/1-valued and of shape (logits.shape[0],). Losses corresponding to mask values of False will be treated as constant and equal to 0. (Default: None)
weight (Numeric) – The constant scalar coefficient which this loss is multiplied by. Note that this must be constant and independent of the network’s parameters. (Default: 1.0)

register_categorical_predictive_distribution

kfac_jax.register_categorical_predictive_distribution(logits, targets=None, mask=None, weight=1.0)[source]

Registers a categorical predictive distribution.

This corresponds to a softmax cross-entropy loss of the form

weight * jnp.sum(softmax_cross_entropy(logits, targets)) / batch_size.

NOTE: this is distinct from register_multi_bernoulli_predictive_distribution() and should not be confused with it.

Parameters

logits (Array) – The logits of the distribution (i.e. its parameters) as a 2D array of floats. The first dimension will usually be the batch size, but doesn’t need to be (unless using estimation_mode='fisher_exact' or estimation_mode='ggn_exact' in the optimizer/estimator). The second dimension is the one over which the softmax is computed.
targets (Optional[Array]) – (OPTIONAL) The values at which the log probability of this distribution is evaluated (to give the loss). Must be a 2D array of integers with shape (logits.shape[0],). Only required if using estimation_mode='fisher_empirical' in the optimizer/estimator. (Default: None)
mask (Optional[Array]) – (OPTIONAL) Mask to apply to log probabilities generated by the distribution. Should be 0/1-valued and of shape (logits.shape[0],). Log probablities corresponding to mask values of False will be treated as constant and equal to 0. (Default: None)
weight (Numeric) – The constant scalar coefficient that the log prob loss associated with this distribution is multiplied by. This is NOT equivalent to changing the temperature of the distribution since we don’t renormalize the log prob in the objective function. Note that this must be constant and independent of the network’s parameters. (Default: 1.0)

register_squared_error_loss

kfac_jax.register_squared_error_loss(prediction, targets=None, weight=1.0)[source]

Registers a squared error loss function.

This assumes a squared error loss of the form weight * jnp.sum((targets - prediction)**2) / batch_size.

If your loss uses a coefficient of 0.5 you need to set the weight argument to reflect this.

NOTE: this function assumes you are not averaging over non-batch dimensions when computing the loss. e.g. if dimension 0 were the batch dimension, this corresponds to ``jnp.mean(jnp.sum((target - prediction)**2,

axis=range(1, target.ndims)), axis=0)``

and not jnp.mean((target - prediction)**2) If your loss is of the latter form you can compensate for it by passing the appropriate value to weight.

NOTE: even though prediction and targets are interchangeable in the definition of the squared error loss, they are not interchangeable in this function. prediction must be the output of your parameterized function (e.g. neural network), and targets must not depend on the parameters. Mixing the two up could lead to a silent failure of the curvature estimation.

Parameters

prediction (Array) – The prediction made by the network (i.e. its output). The first dimension will usually be the batch size, but doesn’t need to be (unless using estimation_mode='fisher_exact' or estimation_mode='ggn_exact' in the optimizer/estimator).
targets (Optional[Array]) – (OPTIONAL) The targets for the loss function. Only required if using estimation_mode='fisher_empirical' in the optimizer/estimator. (Default: None)
weight (Numeric) – The constant scalar coefficient which this loss is multiplied by. Note that this must be constant and independent of the network’s parameters. (Default: 1.0)

register_normal_predictive_distribution

kfac_jax.register_normal_predictive_distribution(mean, targets=None, variance=0.5, weight=1.0, normalize_log_prob=True)[source]

Registers a normal predictive distribution.

This corresponds to a squared error loss of the form: weight/(2*var) * jnp.sum((targets - mean)**2) / batch_size.

NOTE: this function assumes you are not averaging over non-batch dimensions when computing the loss. e.g. if dimension 0 were the batch dimension, this corresponds to ``jnp.mean(jnp.sum((target - prediction)**2,

axis=range(1,target.ndims)), axis=0)``

and not jnp.mean((target - prediction)**2). If your loss is of the latter form you can compensate for it by passing the appropriate value to weight.

Parameters

mean (Array) – A tensor defining the mean vector of the distribution. The first dimension will usually be the batch size, but doesn’t need to be (unless using estimation_mode='fisher_exact' or estimation_mode='ggn_exact' in the optimizer/estimator).
targets (Optional[Array]) – (OPTIONAL) The targets for the loss function. Only required if using estimation_mode='fisher_empirical' in the optimizer/estimator. (Default: None)
variance (float) – The variance of the distribution. Must be a constant scalar, independent of the network’s parameters. Note that the default value of 0.5 corresponds to a standard squared error loss weight * jnp.sum((target - prediction)**2). If you want your squared error loss to be of the form 0.5*coeff*jnp.sum((target - prediction)**2) you should use variance=1.0. (Default: 0.5)
weight (Numeric) – A constant scalar coefficient that the log prob loss associated with this distribution is multiplied by. In general this is NOT equivalent to changing the temperature of the distribution, but in the case of normal distributions it may be. Note that this must be constant and independent of the network’s parameters. (Default: 1.0)
normalize_log_prob (bool) – Whether the negative log prob loss associated to this this distribution should include the additive normalization constant (which is constant and depends on variance) that makes it a true log prob, and not just a squared error loss. Note that this has no effect on the behavior of optimizer with the exception of in niche situations where the loss value is computed from the registrations. e.g., when include_registered_loss_in_stats=True is used. (Default: True)

Curvature Blocks

`CurvatureBlock`(layer_tag_eq, name)	Abstract class for curvature approximation blocks.
`ScaledIdentity`(layer_tag_eq, name[, scale])	A block that assumes that the curvature is a scaled identity matrix.
`Diagonal`(layer_tag_eq, name)	An abstract class for approximating only the diagonal of curvature.
`Full`(layer_tag_eq, name[, ...])	An abstract class for approximating the block matrix with a full matrix.
`TwoKroneckerFactored`(layer_tag_eq, name)	A Kronecker factored block for layers with weights and an optional bias.
`NaiveDiagonal`(layer_tag_eq, name)	Approximates the diagonal of the curvature with in the most obvious way.
`NaiveFull`(layer_tag_eq, name[, ...])	Approximates the full curvature with in the most obvious way.
`DenseDiagonal`(layer_tag_eq, name)	A Diagonal block specifically for dense layers.
`DenseFull`(layer_tag_eq, name[, ...])	A Full block specifically for dense layers.
`DenseTwoKroneckerFactored`(layer_tag_eq, name)	A `TwoKroneckerFactored` block specifically for dense layers.
`Conv2DDiagonal`(layer_tag_eq, name[, ...])	A `Diagonal` block specifically for 2D convolution layers.
`Conv2DFull`(layer_tag_eq, name[, ...])	A `Full` block specifically for 2D convolution layers.
`Conv2DTwoKroneckerFactored`(layer_tag_eq, name)	A `TwoKroneckerFactored` block specifically for 2D convolution layers.
`ScaleAndShiftDiagonal`(layer_tag_eq, name)	A diagonal approximation specifically for a scale and shift layers.
`ScaleAndShiftFull`(layer_tag_eq, name[, ...])	A full dense approximation specifically for a scale and shift layers.
`set_max_parallel_elements`(value)	Sets the default value of maximum parallel elements in the module.
`get_max_parallel_elements`()	Returns the default value of maximum parallel elements in the module.
`set_default_eigen_decomposition_threshold`(value)	Sets the default value of the eigen decomposition threshold.
`get_default_eigen_decomposition_threshold`()	Returns the default value of the eigen decomposition threshold.

CurvatureBlock

class kfac_jax.CurvatureBlock(layer_tag_eq, name)[source]

Abstract class for curvature approximation blocks.

A CurvatureBlock defines a curvature matrix to be estimated, and gives methods to multiply powers of this with a vector. Powers can be computed exactly or with a class-determined approximation. Cached versions of the powers can be pre-computed to make repeated multiplications cheaper. During initialization, you would have to explicitly specify all powers that you will need to cache.

class State(cache)[source]

Persistent state of the block.

Any subclasses of CurvatureBlock should also internally extend this class, with any attributes needed for the curvature estimation.

cache

A dictionary, containing any state data that is updated on irregular intervals, such as inverses, eigenvalues, etc. Elements of this are updated via calls to update_cache(), and do not necessarily correspond to the most up-to-date curvature estimate.

Type: Optional[Dict[str, Union[Array, Dict[str, Array]]]]

__init__(cache)

__init__(layer_tag_eq, name)[source]

Initializes the block.

Parameters

layer_tag_eq (tags.LayerTagEqn) – The Jax equation corresponding to the layer tag that this block will approximate the curvature to.
name (str) – The name of this block.

property layer_tag_primitive: tags.LayerTag

The jax.core.Primitive corresponding to the block’s tag equation.

Return type: tags.LayerTag

property parameter_variables: Tuple[jax.core.Var, ...]

The parameter variables of the underlying Jax equation.

Return type: Tuple[jax.core.Var, …]

property outputs_shapes: Tuple[Shape, ...]

The shapes of the output variables of the block’s tag equation.

Return type: Tuple[Shape, …]

property inputs_shapes: Tuple[Shape, ...]

The shapes of the input variables of the block’s tag equation.

Return type: Tuple[Shape, …]

property parameters_shapes: Tuple[Shape, ...]

The shapes of the parameter variables of the block’s tag equation.

Return type: Tuple[Shape, …]

property parameters_canonical_order: Tuple[int, ...]

The canonical order of the parameter variables.

Return type: Tuple[int, …]

property layer_tag_extra_params: Dict[str, Any]

Any extra parameters of passed into the Jax primitive of this block.

Return type: Dict[str, Any]

property number_of_parameters: int

Number of parameter variables of this block.

Return type: int

property dim: int

The number of elements of all parameter variables together.

Return type: int

scale(state, use_cache)[source]

A scalar pre-factor of the curvature approximation.

Importantly, all methods assume that whenever a user requests cached values, any state dependant scale is taken into account by the cache (e.g. either stored explicitly and used or mathematically added to values).

Parameters

state (CurvatureBlock.State) – The state for this block.
use_cache (bool) – Whether the method requesting this is using cached values or not.

Return type

Numeric

Returns

A scalar value to be multiplied with any unscaled block representation.

fixed_scale()[source]

A fixed scalar pre-factor of the curvature (e.g. constant).

Return type: Numeric

state_dependent_scale(state)[source]

A scalar pre-factor of the curvature, computed from the most fresh curvature estimate.

Return type: Numeric

init(rng, exact_powers_to_cache, approx_powers_to_cache, cache_eigenvalues)[source]

Initializes the state for this block.

Parameters

rng (PRNGKey) – The PRNGKey which to be used for any randomness of the initialization
exact_powers_to_cache (Optional[ScalarOrSequence]) – A single value, or multiple values in a list, which specify which exact matrix powers the block should be caching. Matrix powers, which are expected to be used in multiply_matpower(), multiply_inverse() or multiply() with exact_power=True and use_cached=True must be provided here.
approx_powers_to_cache (Optional[ScalarOrSequence]) – A single value, or multiple values in a list, which specify approximate matrix powers the block should be caching. Matrix powers, which are expected to be used in multiply_matrix_power(), multiply_inverse() or multiply() with exact_power=False and use_cached=True must be provided here.
cache_eigenvalues (bool) – Specifies whether the block should be caching the eigenvalues of its approximate curvature.

Return type

Returns

A dictionary with the initialized state.

abstract sync(state, pmap_axis_name)[source]

Syncs the state across different devices (does not sync the cache).

Return type: CurvatureBlock.State

multiply_matpower(state, vector, identity_weight, power, exact_power, use_cached)[source]

Computes (BlockMatrix + identity_weight I)**power times vector.

Parameters

state (CurvatureBlock.State) – The state for this block.
vector (Sequence[Array]) – A tuple of arrays that should have the same shapes as the block’s parameters_shapes, which represent the vector you want to multiply.
identity_weight (Numeric) – A scalar specifying the weight on the identity matrix that is added to the block matrix before raising it to a power. If use_cached=False it is guaranteed that this argument will be used in the computation. When returning cached values, this argument may be ignored in favor whatever value was last passed to update_cache(). The precise semantics of this depend on the concrete subclass and its particular behavior in regard to caching.
power (Scalar) – The power to which to raise the matrix.
exact_power (bool) – Specifies whether to compute the exact matrix power of BlockMatrix + identity_weight I. When this argument is False the exact behaviour will depend on the concrete subclass and the result will in general be an approximation to (BlockMatrix + identity_weight I)^power, although some subclasses may still compute the exact matrix power.
use_cached (bool) – Whether to use a cached version for computing the product or to use the most recent curvature estimates. The cached version is going to be at least as fresh as the value provided to the last call to update_cache() with the same value of power

Return type

Tuple[Array, …]

Returns

A tuple of arrays, representing the result of the matrix-vector product.

multiply(state, vector, identity_weight, exact_power, use_cached)[source]

Computes (BlockMatrix + identity_weight I) times vector.

Return type: Tuple[Array, …]

multiply_inverse(state, vector, identity_weight, exact_power, use_cached)[source]

Computes (BlockMatrix + identity_weight I)^-1 times vector.

Return type: Tuple[Array, …]

eigenvalues(state, use_cached)[source]

Computes the eigenvalues for this block approximation.

Parameters

state (CurvatureBlock.State) – The state dict for this block.
use_cached (bool) – Whether to use a cached versions of the eigenvalues or to use the most recent curvature estimates to compute them. The cached version are going to be at least as fresh as the last time you called update_cache() with eigenvalues=True.

Return type

Array

Returns

An array containing the eigenvalues of the block.

abstract update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (CurvatureBlock.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

update_cache(state, identity_weight, exact_powers, approx_powers, eigenvalues)[source]

Updates the cached estimates of the different powers specified.

Parameters

state (CurvatureBlock.State) – The state dict for this block to update.
identity_weight (Numeric) – The weight of the identity added to the block’s curvature matrix before computing the cached matrix power.
exact_powers (Optional[ScalarOrSequence]) – Specifies any cached exact matrix powers to be updated.
approx_powers (Optional[ScalarOrSequence]) – Specifies any cached approximate matrix powers to be updated.
eigenvalues (bool) – Specifies whether to update the cached eigenvalues of the block. If they have not been cached before, this will create an entry with them in the block’s cache.

Return type

Returns

The updated state.

to_dense_matrix(state)[source]

Returns a dense representation of the approximate curvature matrix.

Return type: Array

norm(state, norm_type)[source]

Computes the norm of the curvature block, according to norm_type.

Return type: Numeric

ScaledIdentity

class kfac_jax.ScaledIdentity(layer_tag_eq, name, scale=1.0)[source]

A block that assumes that the curvature is a scaled identity matrix.

__init__(layer_tag_eq, name, scale=1.0)[source]

Initializes the block.

Parameters

layer_tag_eq (tags.LayerTagEqn) – The Jax equation corresponding to the layer tag, that this block will approximate the curvature to.
name (str) – The name of this block.
scale (Numeric) – The scale of the identity matrix.

fixed_scale()[source]

A fixed scalar pre-factor of the curvature (e.g. constant).

Return type: Numeric

sync(state, pmap_axis_name)[source]

Syncs the state across different devices (does not sync the cache).

Return type: CurvatureBlock.State

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (CurvatureBlock.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

Diagonal

class kfac_jax.Diagonal(layer_tag_eq, name)[source]

An abstract class for approximating only the diagonal of curvature.

class State(cache, diagonal_factors)[source]

Persistent state of the block.

diagonal_factors

A tuple of the moving averages of the estimated diagonals of the curvature for each parameter that is part of the associated layer.

Type: Tuple[utils.WeightedMovingAverage]

__init__(cache, diagonal_factors)

sync(state, pmap_axis_name)[source]

Syncs the state across different devices (does not sync the cache).

Return type: Diagonal.State

Full

class kfac_jax.Full(layer_tag_eq, name, eigen_decomposition_threshold=None)[source]

An abstract class for approximating the block matrix with a full matrix.

class State(cache, matrix)[source]

Persistent state of the block.

matrix

A moving average of the estimated curvature matrix for all parameters that are part of the associated layer.

Type: utils.WeightedMovingAverage

__init__(cache, matrix)

__init__(layer_tag_eq, name, eigen_decomposition_threshold=None)[source]

Initializes the block.

Parameters

layer_tag_eq (tags.LayerTagEqn) – The Jax equation corresponding to the layer tag that this block will approximate the curvature to.
name (str) – The name of this block.
eigen_decomposition_threshold (Optional[int]) – During calls to init and update_cache if higher number of matrix powers than this threshold are requested, instead of computing individual approximate powers, will directly compute the eigen-decomposition instead (which provide access to any matrix power). If this is None will use the value returned from get_default_eigen_decomposition_threshold().

parameters_list_to_single_vector(parameters_shaped_list)[source]

Converts values corresponding to parameters of the block to vector.

Return type: Array

single_vector_to_parameters_list(vector)[source]

Reverses the transformation self.parameters_list_to_single_vector.

Return type: Tuple[Array, …]

sync(state, pmap_axis_name)[source]

Syncs the state across different devices (does not sync the cache).

Return type: Full.State

TwoKroneckerFactored

class kfac_jax.TwoKroneckerFactored(layer_tag_eq, name)[source]

A Kronecker factored block for layers with weights and an optional bias.

__init__(layer_tag_eq, name)[source]

Initializes the block.

Parameters

layer_tag_eq (tags.LayerTagEqn) – The Jax equation corresponding to the layer tag that this block will approximate the curvature to.
name (str) – The name of this block.

property has_bias: bool

Whether this layer’s equation has a bias.

Return type: bool

parameters_shaped_list_to_array(parameters_shaped_list)[source]

Combines all parameters to a single non axis grouped array.

Return type: Array

array_to_parameters_shaped_list(array)[source]

An inverse transformation of self.parameters_shaped_list_to_array.

Return type: Tuple[Array, …]

NaiveDiagonal

class kfac_jax.NaiveDiagonal(layer_tag_eq, name)[source]

Approximates the diagonal of the curvature with in the most obvious way.

The update to the curvature estimate is computed by (sum_i g_i) ** 2 / N. where g_i is the gradient of each individual data point, and N is the batch size.

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (NaiveDiagonal.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

NaiveDiagonal.State

NaiveFull

class kfac_jax.NaiveFull(layer_tag_eq, name, eigen_decomposition_threshold=None)[source]

Approximates the full curvature with in the most obvious way.

The update to the curvature estimate is computed by (sum_i g_i) (sum_i g_i)^T / N, where g_i is the gradient of each individual data point, and N is the batch size.

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (Full.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

DenseDiagonal

class kfac_jax.DenseDiagonal(layer_tag_eq, name)[source]

A Diagonal block specifically for dense layers.

property has_bias: bool

Whether the layer has a bias parameter.

Return type: bool

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (Diagonal.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

Diagonal.State

DenseFull

class kfac_jax.DenseFull(layer_tag_eq, name, eigen_decomposition_threshold=None)[source]

A Full block specifically for dense layers.

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (Full.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

DenseTwoKroneckerFactored

class kfac_jax.DenseTwoKroneckerFactored(layer_tag_eq, name)[source]

A TwoKroneckerFactored block specifically for dense layers.

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (KroneckerFactored.State) – The state dict for this block to update.
estimation_data (Mapping[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

KroneckerFactored.State

Conv2DDiagonal

class kfac_jax.Conv2DDiagonal(layer_tag_eq, name, max_elements_for_vmap=None)[source]

A Diagonal block specifically for 2D convolution layers.

__init__(layer_tag_eq, name, max_elements_for_vmap=None)[source]

Initializes the block.

Since there is no ‘nice’ formula for computing the average of the tangents for a 2D convolution, what we do is that we have a function - self.conv2d_tangent_squared - that computes for a single feature map the square of the tangents for the kernel of the convolution. To average over the batch we have two choices - vmap or loop over the batch sequentially using scan. This utility function provides a trade-off by being able to specify the maximum number of batch size that we can vmap over. This means that the maximum memory usage will be max_batch_size_for_vmap times the memory needed when calling self.conv2d_tangent_squared. And the actual vmap will be called ceil(total_batch_size / max_batch_size_for_vmap) number of times in a loop to find the final average.

Parameters

layer_tag_eq (tags.LayerTagEqn) – The Jax equation corresponding to the layer tag, that this block will approximate the curvature to.
name (str) – The name of this block.
max_elements_for_vmap (Optional[int]) – The threshold used for determining how much computation to the in parallel and how much in serial manner. If None will use the value returned by get_max_parallel_elements().

conv2d_tangent_squared(image_features_map, output_tangent)[source]

Computes the elementwise square of a tangent for a single feature map.

Return type: Array

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (Diagonal.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

Diagonal.State

Conv2DFull

class kfac_jax.Conv2DFull(layer_tag_eq, name, max_elements_for_vmap=None)[source]

A Full block specifically for 2D convolution layers.

__init__(layer_tag_eq, name, max_elements_for_vmap=None)[source]

Initializes the block.

Since there is no ‘nice’ formula for computing the average of the tangents for a 2D convolution, what we do is that we have a function - self.conv2d_tangent_squared - that computes for a single feature map the square of the tangents for the kernel of the convolution. To average over the batch we have two choices - vmap or loop over the batch sequentially using scan. This utility function provides a trade-off by being able to specify the maximum batch that that will be handled in a single iteration of the loop. This means that the maximum memory usage will be max_batch_size_for_vmap times the memory needed when calling self.conv2d_tangent_squared. And the actual vmap will be called ceil(total_batch_size / max_batch_size_for_vmap) number of times in a loop to find the final average.

Parameters

layer_tag_eq (tags.LayerTagEqn) – The Jax equation corresponding to the layer tag, that this block will approximate the curvature to.
name (str) – The name of this block.
max_elements_for_vmap (Optional[int]) – The threshold used for determining how much computation to the in parallel and how much in serial manner. If None will use the value returned by get_max_parallel_elements().

conv2d_tangent_outer_product(inputs, tangent_of_outputs)[source]

Computes the outer product of a tangent for a single feature map.

Return type: Array

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (Full.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

Conv2DTwoKroneckerFactored

class kfac_jax.Conv2DTwoKroneckerFactored(layer_tag_eq, name)[source]

A TwoKroneckerFactored block specifically for 2D convolution layers.

fixed_scale()[source]

A fixed scalar pre-factor of the curvature (e.g. constant).

Return type: Numeric

property outputs_channel_index: int

The channels index in the outputs of the layer.

Return type: int

property inputs_channel_index: int

The channels index in the inputs of the layer.

Return type: int

property weights_output_channel_index: int

The channels index in weights of the layer.

Return type: int

property weights_spatial_size: int

The spatial filter size of the weights.

Return type: int

property num_locations: int

The number of spatial locations that each filter is applied to.

Return type: int

property num_inputs_channels: int

The number of channels in the inputs to the layer.

Return type: int

property num_outputs_channels: int

The number of channels in the outputs to the layer.

Return type: int

compute_inputs_stats(inputs, weighting_array=None)[source]

Computes the statistics for the inputs factor.

Return type: Array

compute_outputs_stats(tangent_of_output)[source]

Computes the statistics for the outputs factor.

Return type: Array

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (TwoKroneckerFactored.State) – The state dict for this block to update.
estimation_data (Mapping[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

TwoKroneckerFactored.State

ScaleAndShiftDiagonal

class kfac_jax.ScaleAndShiftDiagonal(layer_tag_eq, name)[source]

A diagonal approximation specifically for a scale and shift layers.

property has_scale: bool

Whether this layer’s equation has a scale.

Return type: bool

property has_shift: bool

Whether this layer’s equation has a shift.

Return type: bool

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (Diagonal.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type

Diagonal.State

ScaleAndShiftFull

class kfac_jax.ScaleAndShiftFull(layer_tag_eq, name, eigen_decomposition_threshold=None)[source]

A full dense approximation specifically for a scale and shift layers.

update_curvature_matrix_estimate(state, estimation_data, ema_old, ema_new, batch_size)[source]

Updates the block’s curvature estimates using the info provided.

Each block in general estimates a moving average of its associated curvature matrix. If you don’t want a moving average you can set ema_old=0 and ema_new=1.

Parameters

state (Full.State) – The state dict for this block to update.
estimation_data (Dict[str, Sequence[Array]]) – A map containing data used for updating the curvature matrix estimate for this block. This can be computed by calling the function returned from layer_tags_vjp(). Please see its implementation for more details on the name of the fields and how they are constructed.
ema_old (Numeric) – Specifies the weight of the old value when computing the updated estimate in the moving average.
ema_new (Numeric) – Specifies the weight of the new value when computing the updated estimate in the moving average.
batch_size (Numeric) – The batch size used in computing the values in info.

Return type