# Optimization

The `.optimization`

module provides:

- an optimizer with weight decay fixed that can be used to fine-tuned models, and
- several schedules in the form of schedule objects that inherit from
`_LRSchedule`

: - a gradient accumulation class to accumulate the gradients of multiple batches

## AdamW (PyTorch)

( params: typing.Iterable[torch.nn.parameter.Parameter] lr: float = 0.001 betas: typing.Tuple[float, float] = (0.9, 0.999) eps: float = 1e-06 weight_decay: float = 0.0 correct_bias: bool = True )

Implements Adam algorithm with weight decay fix as introduced in Decoupled Weight Decay Regularization.

( closure: typing.Callable = None )

Performs a single optimization step.

## AdaFactor (PyTorch)

( params lr = None eps = (1e-30, 0.001) clip_threshold = 1.0 decay_rate = -0.8 beta1 = None weight_decay = 0.0 scale_parameter = True relative_step = True warmup_init = False )

AdaFactor pytorch implementation can be used as a drop in replacement for Adam original fairseq code: https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py

Paper: *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost* https://arxiv.org/abs/1804.04235 Note that
this optimizer internally adjusts the learning rate depending on the *scale_parameter*, *relative_step* and
*warmup_init* options. To use a manual (external) learning rate schedule you should set *scale_parameter=False* and
*relative_step=False*.

This implementation handles low-precision (FP16, bfloat) values, but we have not thoroughly tested.

Recommended T5 finetuning settings (https://discuss.huggingface.co/t/t5-finetuning-tips/684/3):

Training without LR warmup or clip_threshold is not recommended.

- use scheduled LR warm-up to fixed LR
- use clip_threshold=1.0 (https://arxiv.org/abs/1804.04235)

Disable relative updates

Use scale_parameter=False

Additional optimizer operations like gradient clipping should not be used alongside Adafactor

Example:

`Adafactor(model.parameters(), scale_parameter=False, relative_step=False, warmup_init=False, lr=1e-3)`

Others reported the following combination to work well:

`Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)`

When using `lr=None`

with Trainer you will most likely need to use `AdafactorSchedule`

scheduler as following:

```
from transformers.optimization import Adafactor, AdafactorSchedule
optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)
trainer = Trainer(..., optimizers=(optimizer, lr_scheduler))
```

Usage:

```
# replace AdamW with Adafactor
optimizer = Adafactor(
model.parameters(),
lr=1e-3,
eps=(1e-30, 1e-3),
clip_threshold=1.0,
decay_rate=-0.8,
beta1=None,
weight_decay=0.0,
relative_step=False,
scale_parameter=False,
warmup_init=False
)
```

( closure = None )

Performs a single optimization step

## AdamWeightDecay (TensorFlow)

( learning_rate: typing.Union[float, keras.optimizer_v2.learning_rate_schedule.LearningRateSchedule] = 0.001 beta_1: float = 0.9 beta_2: float = 0.999 epsilon: float = 1e-07 amsgrad: bool = False weight_decay_rate: float = 0.0 include_in_weight_decay: typing.Optional[typing.List[str]] = None exclude_from_weight_decay: typing.Optional[typing.List[str]] = None name: str = 'AdamWeightDecay' **kwargs )

Adam enables L2 weight decay and clip_by_global_norm on gradients. Just adding the square of the weights to the
loss function is *not* the correct way of using L2 regularization/weight decay with Adam, since that will interact
with the m and v parameters in strange ways as shown in Decoupled Weight Decay Regularization.

Instead we want ot decay the weights in a manner that doesnât interact with the m/v parameters. This is equivalent to adding the square of the weights to the loss with plain (non-momentum) SGD.

( config )

Creates an optimizer from its config with WarmUp custom object.

( init_lr: float num_train_steps: int num_warmup_steps: int min_lr_ratio: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 weight_decay_rate: float = 0.0 power: float = 1.0 include_in_weight_decay: typing.Optional[typing.List[str]] = None )

Creates an optimizer with a learning rate schedule using a warmup phase followed by a linear decay.

## Schedules

### Learning Rate Schedules (Pytorch)

( value names = None module = None qualname = None type = None start = 1 )

An enumeration.

( name: typing.Union[str, transformers.trainer_utils.SchedulerType] optimizer: Optimizer num_warmup_steps: typing.Optional[int] = None num_training_steps: typing.Optional[int] = None )

Unified API to get any scheduler from its name.

( optimizer: Optimizer last_epoch: int = -1 )

Create a schedule with a constant learning rate, using the learning rate set in optimizer.

( optimizer: Optimizer num_warmup_steps: int last_epoch: int = -1 )

Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and the initial lr set in the optimizer.

( optimizer: Optimizer num_warmup_steps: int num_training_steps: int num_cycles: float = 0.5 last_epoch: int = -1 )

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

( optimizer: Optimizer num_warmup_steps: int num_training_steps: int num_cycles: int = 1 last_epoch: int = -1 )

Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

( optimizer num_warmup_steps num_training_steps last_epoch = -1 )

Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.

( optimizer num_warmup_steps num_training_steps lr_end = 1e-07 power = 1.0 last_epoch = -1 )

Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the
optimizer to end lr defined by *lr_end*, after a warmup period during which it increases linearly from 0 to the
initial lr set in the optimizer.

Note: *power* defaults to 1.0 as in the fairseq implementation, which in turn is based on the original BERT
implementation at
https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/optimization.py#L37

### Warmup (TensorFlow)

( initial_learning_rate: float decay_schedule_fn: typing.Callable warmup_steps: int power: float = 1.0 name: str = None )

Applies a warmup schedule on a given learning rate decay schedule.

## Gradient Strategies

### GradientAccumulator (TensorFlow)

( )

Gradient accumulation utility. When used with a distribution strategy, the accumulator should be called in a
replica context. Gradients will be accumulated locally on each replica and without synchronization. Users should
then call `.gradients`

, scale the gradients if required, and pass the result to `apply_gradients`

.

( )

Resets the accumulated gradients on the current replica.