Keras - Layers


A model is made up of multiple layers.

Activation Functions

The activation function determines what is outputted by neurons of this layer. There are two ways to add an activation function to a layer. You can either add it via the activation argument on any layer or you can add the activation function as a layer.

  • Argument:
model.add(keras.layers.Dense(32, activation='relu'))
  • As a layer:
model.add(keras.layers.Dense(32))
model.add(keras.layers.Activation('relu'))

Built in activation functions:

  • relu
    • The ReLU or rectified linear unit activation function: max(x, 0)
    • Is the generic activation function.
    • To clip, you can set the max value using the max_value argument or the min value using the threshold argument.
  • sigmoid
    • \( \theta (x) = \frac{1}{1 + e^{-x}} \)
    • Values are between 0 and 1
    • S-Shaped
    • Great for making the last layer output a probability.
  • softmax
    • Converts a real vector to a vector of categorical probabilities.
    • Each output is the probability of that output. All outputs sum to 1.
    • Generally used as the activation function for the last layer.
  • softplus
    • \( softplus(x) = log(e^x + 1) \)
  • softsign
    • \( softsign(x) = \frac{x}{\lvert x \rvert + 1} \)
  • tanh
    • Hyperbolic tangent. \( tanh(x) = \frac{sinh(x)}{cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)
    • Outputs are between (-1, 1)
    • S-Shaped
    • Advantages: •Negative numbers are perserved •Zero inputs are mapped near zero
  • selu
    • The SELU or scaled exponential linear unit is related to the ReLU activation function and super related to the Leaky ReLU activation function.
      • if x ≥ 0: return scale * x
      • if x ≤ 0: return scale * alpha * (exp(x) - 1)
    • Unlike ReLU, SELU allows negative values, so these cells cannot die (all become 0).
    • Compared to leaky ReLU, there is an exponential dip instead of a straight line for negative values.
  • elu
    • Exponential Linear Unit. SELU, but without the scale.
  • exponential

Advanced Activation Functions:

  • LeakyReLU
    • tf.keras.layers.LeackyReLU(alpha=0.3)
    • Like exponential linear unit, but negative values are a linear line with a slight slope.

Pros and Cons of most activation function: HERE.

Attention Layers

Base Layer Class

All layers inherit from the base layer class.

tf.keras.layers.Layer(
    trainable=True, name=None, dtype=None, dynamic=False, **kwargs   
)
  • Trainable: If false, the layer is frozen and will not be trained.
  • Name: Name a layer. Can be accessed via model.layers[0].name

Convolutional Layers

Convolutional layers are used with convolutional neural networks (CNNs)

Conv*D

There are three different convolutional layer dimensions:

  • Conv1D: Ex temporal convolutions
  • Conv2D: Ex spatial convolutions over images
  • Conv3D: Ex spatial convolutions over volumes
tf.keras.layers.Conv2D(
    filters,
    kernel_size,
    strides=(1, 1),
    padding="valid",
    data_format=None,
    dilation_rate=(1, 1),
    groups=1,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)
  • Filters: Number of output filers
  • Kernel Size: Height and width of the 2D convolutional window. Tuple of ints
  • Strides: Specifies the strides (movement of the window) along the height and width.
  • Padding:
    • 'valid': No padding. Output shape will be smaller than input shape
    • 'same': Adds padding. Output shape (height/width, not channels/filters) will be the same as the input
  • data_format
    • 'channels_last' (default): Expects (batch_size, height, width, channels) as input
    • 'channels_first': Expects (batch_size, channels, height, width) as input

Input Shape: batch_shape + (channels, rows, cols) if data format is channels first

Output Shape: batch_shape + (filters, new_rows, new_cols) if data format is channels first. Rows and columns may change due to padding.

SeperableConv*D

Similar to Conv*D layers, but channels are kept separate at first and then mixed at the end. The 2D version is similar to an Inception Block.

DepthwiseConv2D

Performs the first half of SeperableConv2D, where channels are kept separate.

Conv*DTranspose (Deconvolution)

"Undoes" a convolutional layer. Is generally used to increase the dimensionality (rows and columns) while decreasing the channel number.

tf.keras.layers.Conv2DTranspose(
    filters,
    kernel_size,
    strides=(1, 1),
    padding="valid",
    output_padding=None,
    data_format=None,
    dilation_rate=(1, 1),
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Core Layers

Input

Used to instantiate a keras tensor

tf.keras.Input(
    shape=None,
    batch_size=None,
    name=None,
    dtype=None,
    sparse=False,
    tensor=None,
    ragged=False,
    **kwargs
)
  • Shape: Input shape, not including batch size. Should be a tuple of integers.
Dense

The most common layer type. A layer that is completely connected to the previous layer.

tf.keras.layers.Dense(
    units,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)
  • Units: The number of neurons
Activation

Add an activation function to the previous layer.

tf.keras.layers.Activation(activation, **kwargs)
Embedding

THIS site does a great job of explaining what an embedding layer does. "an embedding learns tries to find the optimal mapping of each of the unique words to a vector of real numbers. The size of that vectors is equal to the output_dim". An embedding layer maps a vector that consists of a small sample of the vocabulary to a feature vector.

Must be the first layer of a model.

tf.keras.layers.Embedding(
    input_dim,
    output_dim,
    embeddings_initializer="uniform",
    embeddings_regularizer=None,
    activity_regularizer=None,
    embeddings_constraint=None,
    mask_zero=False,
    input_length=None,
    **kwargs
)
  • Input Dim: Vocabulary size, number of possible unique words in an input vector.
  • Output Dim: Dimension of the dense embedding (size of the feature vector for each unique word)
  • Input Length: Use if the input is of a constant length. Required if using Flatten followed by Dense later on.
Masking

Used primarily in RNNs. Skips timesteps. Good for skipping padding when using LSTM.

tf.keras.layers.Masking(mask_value=0.0, **kwargs)
Lambda
tf.keras.layers.Lambda(
    function, output_shape=None, mask=None, arguments=None, **kwargs
)
  • Function: Lambda function

Locally Connected Layers

Merging Layers

Normalization Layers

Pooling Layers

Pooling layers are used to downsample. They are generally used with convolutional layers to reduce the size of the feature space

MaxPooling*D

Max pooling uses passes the max value over a window to the next layer. There are three different pooling layer dimensions:

  • Conv1D: Ex: temporal data
  • Conv2D: Ex spatial data (images)
  • Conv3D: Ex 3D data (spatial or spatio-temporal)
tf.keras.layers.MaxPooling2D(
    pool_size=(2, 2), 
    strides=None, 
    padding="valid", 
    data_format=None, 
    **kwargs
)
  • Pool Size: Size of the window
  • Strides: How far the window moves after each pooling step (int or tuple of ints)
  • Padding:
    • 'valid': No padding. output_shape = (input_shape - pool_size + 1) / strides)
    • 'same': Output will have the same height/width dimensions as input. output_shape = input_shape / strides
  • Data Format: 'channels_last' or 'channels_first'
AveragePooling*D

Average pooling passes the average value over a window to the next layer. There are three different pooling layer dimensions:

tf.keras.layers.AveragePooling2D(
    pool_size=(2, 2), strides=None, padding="valid", data_format=None, **kwargs
)

Args same as MaxPooling

Other

There are also GlobalMaxPooling and GlobalAveragePooling varients that don't use a window, but the entire input.

Preprocessing Layers

Recurrent Layers

Reshaping Layers

Weight Contraints

Constraints can be added to the weights of a layer. For example, a constraint might not allow negative weights or a constraint might limit the norm of a layer.

Weight Initializers

Weight initalizers initialize a layer's weights.

Weight Regularizers

Weight regularizers penalizes certain aspects of a layer's parameters during optimization (training).

Three common regulaizers exist for most layer types:

  • kernel_regularizer: Applies regularization function to the weights matrix
  • bias_regularizer: Applies regularization function to the bias
  • activity_regularizer: Applies regularization function to the output of the layer

Good stackexchange post about the three regularizers.

There are three available regularizers:

  • tf.keras.regularizers.l1(l1=0.01): loss = l1 * reduced_sum(abs(x))
  • tf.keras.regularizers.l2(l1=0.01): loss = l1 * reduced_sum(square(x))
  • tf.keras.regularizers.l1_l2(l1=0.01, l2=0.02)

For example:

layer = tf.keras.layers.Dense(
    units=64,
    kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
    bias_regularizer=regularizers.l2(1e-4),
    activity_regularizer=regularizers.l2(1e-5)
)