Keras - Layers

A model is made up of multiple layers.

Activation Functions

The activation function determines what is outputted by neurons of this layer. There are two ways to add an activation function to a layer. You can either add it via the activation argument on any layer or you can add the activation function as a layer.

Argument:

model.add(keras.layers.Dense(32, activation='relu'))

As a layer:

model.add(keras.layers.Dense(32))
model.add(keras.layers.Activation('relu'))

Built in activation functions:

relu

The ReLU or rectified linear unit activation function: max(x, 0)
Is the generic activation function.
To clip, you can set the max value using the max_value argument or the min value using the threshold argument.

sigmoid

\( \theta (x) = \frac{1}{1 + e^{-x}} \)
Values are between 0 and 1
S-Shaped
Great for making the last layer output a probability.

softmax

Converts a real vector to a vector of categorical probabilities.
Each output is the probability of that output. All outputs sum to 1.
Generally used as the activation function for the last layer.

softplus

\( softplus(x) = log(e^x + 1) \)

softsign

\( softsign(x) = \frac{x}{\lvert x \rvert + 1} \)

tanh

Hyperbolic tangent. \( tanh(x) = \frac{sinh(x)}{cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)
Outputs are between (-1, 1)
S-Shaped
Advantages: •Negative numbers are perserved •Zero inputs are mapped near zero

selu

The SELU or scaled exponential linear unit is related to the ReLU activation function and super related to the Leaky ReLU activation function.

if x ≥ 0: return scale * x
if x ≤ 0: return scale * alpha * (exp(x) - 1)

Unlike ReLU, SELU allows negative values, so these cells cannot die (all become 0).
Compared to leaky ReLU, there is an exponential dip instead of a straight line for negative values.

Exponential Linear Unit. SELU, but without the scale.

exponential

Advanced Activation Functions:

LeakyReLU

tf.keras.layers.LeackyReLU(alpha=0.3)
Like exponential linear unit, but negative values are a linear line with a slight slope.

Pros and Cons of most activation function: HERE.

Attention Layers

Base Layer Class

All layers inherit from the base layer class.

tf.keras.layers.Layer(
    trainable=True, name=None, dtype=None, dynamic=False, **kwargs   
)

Trainable: If false, the layer is frozen and will not be trained.
Name: Name a layer. Can be accessed via model.layers[0].name

Convolutional Layers

Convolutional layers are used with convolutional neural networks (CNNs)

Conv*D

There are three different convolutional layer dimensions:

Conv1D: Ex temporal convolutions
Conv2D: Ex spatial convolutions over images
Conv3D: Ex spatial convolutions over volumes

tf.keras.layers.Conv2D(
    filters,
    kernel_size,
    strides=(1, 1),
    padding="valid",
    data_format=None,
    dilation_rate=(1, 1),
    groups=1,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Filters: Number of output filers
Kernel Size: Height and width of the 2D convolutional window. Tuple of ints
Strides: Specifies the strides (movement of the window) along the height and width.
Padding:

'valid': No padding. Output shape will be smaller than input shape
'same': Adds padding. Output shape (height/width, not channels/filters) will be the same as the input

data_format

'channels_last' (default): Expects (batch_size, height, width, channels) as input
'channels_first': Expects (batch_size, channels, height, width) as input

Input Shape: batch_shape + (channels, rows, cols) if data format is channels first

Output Shape: batch_shape + (filters, new_rows, new_cols) if data format is channels first. Rows and columns may change due to padding.

SeperableConv*D

Similar to Conv*D layers, but channels are kept separate at first and then mixed at the end. The 2D version is similar to an Inception Block.

DepthwiseConv2D

Performs the first half of SeperableConv2D, where channels are kept separate.

Conv*DTranspose (Deconvolution)

"Undoes" a convolutional layer. Is generally used to increase the dimensionality (rows and columns) while decreasing the channel number.

tf.keras.layers.Conv2DTranspose(
    filters,
    kernel_size,
    strides=(1, 1),
    padding="valid",
    output_padding=None,
    data_format=None,
    dilation_rate=(1, 1),
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Core Layers

Input

Used to instantiate a keras tensor

tf.keras.Input(
    shape=None,
    batch_size=None,
    name=None,
    dtype=None,
    sparse=False,
    tensor=None,
    ragged=False,
    **kwargs
)

Shape: Input shape, not including batch size. Should be a tuple of integers.

Dense

The most common layer type. A layer that is completely connected to the previous layer.

tf.keras.layers.Dense(
    units,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Units: The number of neurons

Activation

Add an activation function to the previous layer.

tf.keras.layers.Activation(activation, **kwargs)

Embedding

THIS site does a great job of explaining what an embedding layer does. "an embedding learns tries to find the optimal mapping of each of the unique words to a vector of real numbers. The size of that vectors is equal to the output_dim". An embedding layer maps a vector that consists of a small sample of the vocabulary to a feature vector.

Must be the first layer of a model.

tf.keras.layers.Embedding(
    input_dim,
    output_dim,
    embeddings_initializer="uniform",
    embeddings_regularizer=None,
    activity_regularizer=None,
    embeddings_constraint=None,
    mask_zero=False,
    input_length=None,
    **kwargs
)

Input Dim: Vocabulary size, number of possible unique words in an input vector.
Output Dim: Dimension of the dense embedding (size of the feature vector for each unique word)
Input Length: Use if the input is of a constant length. Required if using Flatten followed by Dense later on.

Masking

Used primarily in RNNs. Skips timesteps. Good for skipping padding when using LSTM.

tf.keras.layers.Masking(mask_value=0.0, **kwargs)

Lambda

tf.keras.layers.Lambda(
    function, output_shape=None, mask=None, arguments=None, **kwargs
)

Function: Lambda function

Locally Connected Layers

Merging Layers

Normalization Layers

Pooling Layers

Pooling layers are used to downsample. They are generally used with convolutional layers to reduce the size of the feature space

MaxPooling*D

Max pooling uses passes the max value over a window to the next layer. There are three different pooling layer dimensions:

Conv1D: Ex: temporal data
Conv2D: Ex spatial data (images)
Conv3D: Ex 3D data (spatial or spatio-temporal)

tf.keras.layers.MaxPooling2D(
    pool_size=(2, 2), 
    strides=None, 
    padding="valid", 
    data_format=None, 
    **kwargs
)

Pool Size: Size of the window
Strides: How far the window moves after each pooling step (int or tuple of ints)
Padding:

'valid': No padding. output_shape = (input_shape - pool_size + 1) / strides)
'same': Output will have the same height/width dimensions as input. output_shape = input_shape / strides

Data Format: 'channels_last' or 'channels_first'

AveragePooling*D

Average pooling passes the average value over a window to the next layer. There are three different pooling layer dimensions:

AveragePooling1D: Ex: temporal data
AveragePooling2D: Ex: spatial data (images)
AveragePooling3D: Ex: 3D data (spatial or spatio-temporal)

tf.keras.layers.AveragePooling2D(
    pool_size=(2, 2), strides=None, padding="valid", data_format=None, **kwargs
)

Args same as MaxPooling

Other

There are also GlobalMaxPooling and GlobalAveragePooling varients that don't use a window, but the entire input.

kernel_regularizer: Applies regularization function to the weights matrix
bias_regularizer: Applies regularization function to the bias
activity_regularizer: Applies regularization function to the output of the layer

Good stackexchange post about the three regularizers.

There are three available regularizers:

tf.keras.regularizers.l1(l1=0.01): loss = l1 * reduced_sum(abs(x))
tf.keras.regularizers.l2(l1=0.01): loss = l1 * reduced_sum(square(x))
tf.keras.regularizers.l1_l2(l1=0.01, l2=0.02)

For example:

layer = tf.keras.layers.Dense(
    units=64,
    kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
    bias_regularizer=regularizers.l2(1e-4),
    activity_regularizer=regularizers.l2(1e-5)
)