Keras

Keras is a high-level deep learning API framework for machine learning platforms such as TensorFlow.

A model is made up of multiple layers.

• Activation Functions

The activation function determines what is outputted by neurons of this layer. There are two ways to add an activation function to a layer. You can either add it via the activation argument on any layer or you can add the activation function as a layer.

Argument:

model.add(keras.layers.Dense(32, activation='relu'))

As a layer:

model.add(keras.layers.Dense(32))
model.add(keras.layers.Activation('relu'))

Built in activation functions:

relu

The ReLU or rectified linear unit activation function: max(x, 0)
Is the generic activation function.
To clip, you can set the max value using the max_value argument or the min value using the threshold argument.

sigmoid

\( \theta (x) = \frac{1}{1 + e^{-x}} \)
Values are between 0 and 1
S-Shaped
Great for making the last layer output a probability.

softmax

Converts a real vector to a vector of categorical probabilities.
Each output is the probability of that output. All outputs sum to 1.
Generally used as the activation function for the last layer.

softplus

\( softplus(x) = log(e^x + 1) \)

softsign

\( softsign(x) = \frac{x}{\lvert x \rvert + 1} \)

tanh

Hyperbolic tangent. \( tanh(x) = \frac{sinh(x)}{cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)
Outputs are between (-1, 1)
S-Shaped
Advantages: •Negative numbers are perserved •Zero inputs are mapped near zero

selu

The SELU or scaled exponential linear unit is related to the ReLU activation function and super related to the Leaky ReLU activation function.

if x ≥ 0: return scale * x
if x ≤ 0: return scale * alpha * (exp(x) - 1)

Unlike ReLU, SELU allows negative values, so these cells cannot die (all become 0).
Compared to leaky ReLU, there is an exponential dip instead of a straight line for negative values.

Exponential Linear Unit. SELU, but without the scale.

exponential

Advanced Activation Functions:

LeakyReLU

tf.keras.layers.LeackyReLU(alpha=0.3)
Like exponential linear unit, but negative values are a linear line with a slight slope.

Pros and Cons of most activation function: HERE.

• Attention Layers

• Base Layer Class

All layers inherit from the base layer class.

tf.keras.layers.Layer(
    trainable=True, name=None, dtype=None, dynamic=False, **kwargs   
)

Trainable: If false, the layer is frozen and will not be trained.
Name: Name a layer. Can be accessed via model.layers[0].name

• Convolutional Layers

Convolutional layers are used with convolutional neural networks (CNNs)

Conv*D

There are three different convolutional layer dimensions:

Conv1D: Ex temporal convolutions
Conv2D: Ex spatial convolutions over images
Conv3D: Ex spatial convolutions over volumes

tf.keras.layers.Conv2D(
    filters,
    kernel_size,
    strides=(1, 1),
    padding="valid",
    data_format=None,
    dilation_rate=(1, 1),
    groups=1,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Filters: Number of output filers
Kernel Size: Height and width of the 2D convolutional window. Tuple of ints
Strides: Specifies the strides (movement of the window) along the height and width.
Padding:

'valid': No padding. Output shape will be smaller than input shape
'same': Adds padding. Output shape (height/width, not channels/filters) will be the same as the input

data_format

'channels_last' (default): Expects (batch_size, height, width, channels) as input
'channels_first': Expects (batch_size, channels, height, width) as input

Input Shape: batch_shape + (channels, rows, cols) if data format is channels first

Output Shape: batch_shape + (filters, new_rows, new_cols) if data format is channels first. Rows and columns may change due to padding.

SeperableConv*D

Similar to Conv*D layers, but channels are kept separate at first and then mixed at the end. The 2D version is similar to an Inception Block.

DepthwiseConv2D

Performs the first half of SeperableConv2D, where channels are kept separate.

Conv*DTranspose (Deconvolution)

"Undoes" a convolutional layer. Is generally used to increase the dimensionality (rows and columns) while decreasing the channel number.

tf.keras.layers.Conv2DTranspose(
    filters,
    kernel_size,
    strides=(1, 1),
    padding="valid",
    output_padding=None,
    data_format=None,
    dilation_rate=(1, 1),
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

• Core Layers

Input

Used to instantiate a keras tensor

tf.keras.Input(
    shape=None,
    batch_size=None,
    name=None,
    dtype=None,
    sparse=False,
    tensor=None,
    ragged=False,
    **kwargs
)

Shape: Input shape, not including batch size. Should be a tuple of integers.

Dense

The most common layer type. A layer that is completely connected to the previous layer.

tf.keras.layers.Dense(
    units,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Units: The number of neurons

Activation

Add an activation function to the previous layer.

tf.keras.layers.Activation(activation, **kwargs)

Embedding

THIS site does a great job of explaining what an embedding layer does. "an embedding learns tries to find the optimal mapping of each of the unique words to a vector of real numbers. The size of that vectors is equal to the output_dim". An embedding layer maps a vector that consists of a small sample of the vocabulary to a feature vector.

Must be the first layer of a model.

tf.keras.layers.Embedding(
    input_dim,
    output_dim,
    embeddings_initializer="uniform",
    embeddings_regularizer=None,
    activity_regularizer=None,
    embeddings_constraint=None,
    mask_zero=False,
    input_length=None,
    **kwargs
)

Input Dim: Vocabulary size, number of possible unique words in an input vector.
Output Dim: Dimension of the dense embedding (size of the feature vector for each unique word)
Input Length: Use if the input is of a constant length. Required if using Flatten followed by Dense later on.

Masking

Used primarily in RNNs. Skips timesteps. Good for skipping padding when using LSTM.

tf.keras.layers.Masking(mask_value=0.0, **kwargs)

Lambda

tf.keras.layers.Lambda(
    function, output_shape=None, mask=None, arguments=None, **kwargs
)

Function: Lambda function

• Locally Connected Layers

• Merging Layers

• Normalization Layers

• Pooling Layers

Pooling layers are used to downsample. They are generally used with convolutional layers to reduce the size of the feature space

MaxPooling*D

Max pooling uses passes the max value over a window to the next layer. There are three different pooling layer dimensions:

Conv1D: Ex: temporal data
Conv2D: Ex spatial data (images)
Conv3D: Ex 3D data (spatial or spatio-temporal)

tf.keras.layers.MaxPooling2D(
    pool_size=(2, 2), 
    strides=None, 
    padding="valid", 
    data_format=None, 
    **kwargs
)

Pool Size: Size of the window
Strides: How far the window moves after each pooling step (int or tuple of ints)
Padding:

'valid': No padding. output_shape = (input_shape - pool_size + 1) / strides)
'same': Output will have the same height/width dimensions as input. output_shape = input_shape / strides

Data Format: 'channels_last' or 'channels_first'

AveragePooling*D

Average pooling passes the average value over a window to the next layer. There are three different pooling layer dimensions:

AveragePooling1D: Ex: temporal data
AveragePooling2D: Ex: spatial data (images)
AveragePooling3D: Ex: 3D data (spatial or spatio-temporal)

tf.keras.layers.AveragePooling2D(
    pool_size=(2, 2), strides=None, padding="valid", data_format=None, **kwargs
)

Args same as MaxPooling

Other

There are also GlobalMaxPooling and GlobalAveragePooling varients that don't use a window, but the entire input.

• Preprocessing Layers

• Recurrent Layers

• Reshaping Layers

• Weight Contraints

Constraints can be added to the weights of a layer. For example, a constraint might not allow negative weights or a constraint might limit the norm of a layer.

• Weight Initializers

Weight initalizers initialize a layer's weights.

• Weight Regularizers

Weight regularizers penalizes certain aspects of a layer's parameters during optimization (training).

Three common regulaizers exist for most layer types:

kernel_regularizer: Applies regularization function to the weights matrix
bias_regularizer: Applies regularization function to the bias
activity_regularizer: Applies regularization function to the output of the layer

Good stackexchange post about the three regularizers.

There are three available regularizers:

tf.keras.regularizers.l1(l1=0.01): loss = l1 * reduced_sum(abs(x))
tf.keras.regularizers.l2(l1=0.01): loss = l1 * reduced_sum(square(x))
tf.keras.regularizers.l1_l2(l1=0.01, l2=0.02)

For example:

layer = tf.keras.layers.Dense(
    units=64,
    kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
    bias_regularizer=regularizers.l2(1e-4),
    activity_regularizer=regularizers.l2(1e-5)
)

Models

A model in keras is just a group of layers.

A complete example that goes through creating, training, and evaluating a keras model:

from sklearn.datasets import load_iris
from sklearn.preprocessing import LabelBinarizer
from sklearn.utils import shuffle
import keras

# Load dataset
iris = load_iris()
data = iris.data
enc = LabelBinarizer()
target = enc.fit_transform(iris.target)
X, y = shuffle(data, target, random_state=0)

# Make model
inputs = keras.Input(shape=(4,))
x = keras.layers.Dense(5, activation='relu')(inputs)
outputs = keras.layers.Dense(3, activation='softmax')(x)
model = keras.Model(inputs=inputs, outputs=outputs)

# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train
model.fit(x=X, y=y, batch_size=8, epochs=150, validation_split=0.3)

# Evaluate
loss, accuracy = model.evaluate(X, y)

# Predict
predictions = model.predict(X)  # softmax gives us a probability of each category

• Model

The Model class requires two things, the inputs to a model and the outputs to a model. There is an optimal name parameter

There are two ways to instantiate a Model class:

Functions API
By subclassing the Model class

Function API method:

import tensorflow as tf
inputs = tf.keras.Input(shape=(3,))
x = tf.keras.layers.Dense(4, activation=tf.nn.relu)(inputs)
outputs = tf.keras.layers.Dense(5, activation=tf.nn.softmax)(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)

• Save And Load

Save tf.keras

Saving a model whole is super easy.

model.save('my_model')

This will create a directory called my_model with assets, saved_model.pb, and variables as contents. This only works if using tf.keras instead of native keras. See below if using native keras.

To load it again:

model = tf.keras.models.load_model('my_model')

Save keras

Saving the model as a single HDF5 is an option, however some items are not saved, such as custom layers and external losses and metrics. These can be quite annoying to add back later if you get a model from someone else.

H5:

model.save('model.h5')
model = tf.keras.models.load_model('model.h5')

Other Save/Load Functions

model.get_weights()
model.set_weights(weights)
model.save_weights('file_path.h5')
model.load_weights('file_path.h5')
model.to_json()
model = tf.keras.models.model_from_json(config)
new_model = tf.keras.models.clone_model(model)

• Sequential

The Sequential class allows you to add layers sequentially to a model.

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(16,)))  # Add input layer that accepts a feature vector of length 16
model.add(tf.keras.layers.Dense(6))  # Adds a layer containing 6 neurons

• Summary

Model.summary() can be used to summarize your model, but outputting the layers

Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         [(None, 3)]               0
_________________________________________________________________
dense (Dense)                (None, 4)                 16
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 25
=================================================================
Total params: 41
Trainable params: 41
Non-trainable params: 0
_________________________________________________________________

• Training

Keras training APIs involve compiling, fitting, evaluating, and predicting using a model.

Compile

Prepares the model for training (does a lot of hidden stuff).

Model.compile(
    optimizer="rmsprop",
    loss=None,
    metrics=None,
    loss_weights=None,
    weighted_metrics=None,
    run_eagerly=None,
    steps_per_execution=None,
    **kwargs
)

Optimizer: Adam is the most popular optimizer.
Loss: The neural network will try to minimize this value via the optimization algorithm. There are a large number of loss functions. Most notably (note, for keras, you can use a string instead of the tensorflow functions by snake-casing the function as a string):

Classification (Categorical) Data:

BinaryCrossentropy: Used when there are only two possible labels (0 and 1)
CategoricalCrossentropy: Used when there are two+ possible classes. Expects labels to be encoded via one-hot representation.
SparseCategoricalCrossentropy: Sibling to CategoricalCrossentropy. Expects an integer encoding instead of one-hot. Integers are distinct classes, similarity via closeness is not assumed.

Regression (Continuous) Data:

MeanSquaredError: Loss = square(y_true - y_pred)

Metrics: List of metrics to output during training and returned during fitting. ['accuracy'] is the most common metric
Loss Weights: If a list of losses is given as the lost function, you can specify how heavily waited each loss function is. For example [10, 1] would weight the first loss function 10 times heavier than the second loss function.

Fit

Used to train a model.

Model.fit(
    x=None,
    y=None,
    batch_size=None,
    epochs=1,
    verbose=1,
    callbacks=None,
    validation_split=0.0,
    validation_data=None,
    shuffle=True,
    class_weight=None,
    sample_weight=None,
    initial_epoch=0,
    steps_per_epoch=None,
    validation_steps=None,
    validation_batch_size=None,
    validation_freq=1,
    max_queue_size=10,
    workers=1,
    use_multiprocessing=False,
)

Verbose: 0 = silent, 1 = progress bar per epoch, 2 = one line output per epoch
Callbacks: List of callbacks. My documentation on callbacks can be found here.
Validation Split: Float between 0 and 1. Fraction of training data to use for validation. Uses the ending % BEFORE shuffling.
Validation Data: Data to use for validation. Data should be in (x_val, y_val) format. Do not use with validation_split.
Class Weight: Dictionary mapping class indices (integers) to weight (float). Useful for unbalanced data (where there are more samples of one class than another)
Sample Weight: Weigh samples differently. 1D numpy array of sample size is expected.
Validation Frequency: How often, in terms of epochs, to validate the data.
Generator Specific Arguments:

Steps Per Epoch: Number of batches required to declare an epoch. Needed for generators. Same idea for validation_steps.
Max Queue Size: Number of samples to queue for a generator. Defaults to 10.
Workers: Number of workers used for generators. Defaults to 1.
Use Multiprocessing: Use process-based threading for generators.

Returns a History object that can be used for plotting.

Evaluate

Like fit, but without the training. Used to find the loss and metric values for the model.

Model.evaluate(
    x=None,
    y=None,
    batch_size=None,
    verbose=1,
    sample_weight=None,
    steps=None,
    callbacks=None,
    max_queue_size=10,
    workers=1,
    use_multiprocessing=False,
    return_dict=False,
)

Return Dict: Return the loss and metric results as a dictionary instead of a list. Key is the name of the metric. If False, a list (or single value) is returned.

Predict

Used to predict data. A batch is expected.

Model.predict(
    x,
    batch_size=None,
    verbose=0,
    steps=None,
    callbacks=None,
    max_queue_size=10,
    workers=1,
    use_multiprocessing=False,
)

Numpy array of predictions is returned