13. The optimizers
(Source)
In Model.train_step()
, which we introduced in previous chapters,
optimizer.minimize()
is called directly to update the trainable variables to
reduce the loss function value, while the gradients are recorded in the tape
.
The pseudo-code is shown below.
class Model(Layer):
def train_step(self, data):
x, y = data_adapter.unpack(data)
with tf.GradientTape() as tape:
y_pred = self(x, training=True)
loss = self.compiled_loss(y, y_pred)
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
self.compiled_metrics.update_state(y, y_pred)
return {metric.name: metric.result() for metric in self.metrics}
To understand how this optimizer works, let's see what happens behind the
.minimize()
function.
All optimizers in Keras extends the OptimizerV2
class, which extends the
Tractable
class. Remember that the Layer
class also extends the
Tractable
class. They all have variables to track. Any tf.Variable
in its
attributes will be tracked automatically by TensorFlow.
In OptimizerV2.minimize()
, which calls another method,
OptimizerV2.apply_gradients()
to update the gradients. In .minimize()
the
gradients are either obtained from the gradient tape, which is passed to it as
an argument or computed in the function using loss
and var_list
, which is a
list of trainable variables passed to it. When calling .apply_gradients()
,
we zip the gradients and their corresponding variables into pairs and pass them
to it.
In .apply_gradients()
, it updates the variables distributedly. The internal
function of update_var()
is executed distributedly, which calls
._resource_apply_dense()
, which is a function for the subclasses to override
to update the variable values with the gradients.
(Source)
class OptimizerV2(Trackable):
def minimize(self, loss, var_list, tape=None):
tape = tape if tape is not None else tf.GradientTape()
with tape:
grads = tape.gradient(loss, var_list)
self.apply_gradients(zip(grads, var_list))
def apply_gradients(grads_and_vars):
def update_var(var, grad):
return self._resource_apply_dense(grad, var)
strategy = tf.distribute.get_strategy()
for grad, var in grads_and_vars:
with strategy.extended.colocate_vars_with(var):
distribution.extended.update(var, update_var, args=(grad, ))
def _resource_apply_dense(self, grad, var):
raise NotImplementedError
TensorFlow API
tf.GradientTape()
Besides recording the tape using awith
statement, it can also be used in a stand-alone mode to return the gradient tape that automatically recorded the gradients during the forward-pass by TensorFlow.TensorFlow API
tf.distribute.StrategyExtended
Thestrategy.extended
in the code example above is actually an instance ofStrategyExtended
. All distribute strategies in TensorFlow have a.extended
attribute. It exposes some device and locality control of the variables and tensors. For example,.colocate_vars_with(var)
opens a scope where all the newly created variables would be on the same device asvar
..update(var, update_var, args=(grad, ))
runupdate_var
to updatevar
by mirroring the args to the same device.
To make your own optimizer, you may need to override some of the functions, for
example, ._resource_apply_dense()
. Here is the pseudo-code for implementing
a stochastic gradient descent optimizer. We just override
._resource_apply_dense()
and call the corresponding TensorFlow operation to
update the variables.
(Source)
class SGD(OptimizerV2):
def _resource_apply_dense(self, grad, var):
tf.raw_ops.ResourceApplyGradientDescent(var=var.handle, delta=grad)
TensorFlow API
tf.raw_ops
Thisraw_ops
module in TensorFlow is a collection of raw C++ TensorFlow ops for the user to directly use in Python. Each op is a series of tensor operations that corresponds to a GPU kernel implemented in TensorFlow. Please refer to this guide for more details about a TensorFlow op. It shows how to create a custom op.