Study of U_CARFnet and Tensorflow

Deep Neural Network Concepts

Evaluation

High Recall Regime

High recall regime refers to a training strategy where the model is optimized to capture as many of relevant positive samples as possible, even if it means including some incorrect predictions( fasle positives).

Recall is a metric that quantifies how well the model finds all the actual positives:

Recall=TruePositivesTruePositives+FalseNegativesRecall = \frac{True Positives}{True Positives + False Negatives}

Key points of high recall regime:

  • Focus on completeness: try to find all the true positives, minimizing false negatives ( missed relevant samples)
  • Tolerance for errors: The model accepts a higher rate of false positives to achieve better recall.
  • Use cases: Common in situations where missing a positive is more costly than incorrectly including a negative (e.g., disease diagnosis, fraud detection)

F-score

The F-score, also known as the F-measure, is a common metric for evaluating model performance, especially in classification problems. It combines the model’s precision and recall into a single, balanced score.
In practice, there’s often a trade-off between precision and recall. Increasing precision might decrease recall, and vice versa. The F-score uses a harmonic mean to combine both metrics, providing a better overall assessment of the model’s capabilities.

F-score=(1+β2)PrecisionRecallβ2Precision+Recall\text{F-score} = \frac{(1 + \beta^2) * Precision * Recall} {\beta^2 * Precision + Recall}

Where:

  • β\beta is a non-negative parameter that adjusts the weights of precision and recall
    • when β=1\beta = 1, the formula simplifies to the F1-score formula, where precision and recall are given equal weight.
    • when β>1\beta > 1, recall is given more weight, and the model will be more inclined to improve recall.
    • when β<1\beta < 1, precision is given more weight, and the model will be more inclined to improve precision
  • Precision = TPTP+FP\frac{TP}{TP+FP} (Out of all the positive predictions, how many were actually correct?)
  • Recall = TPTP+FN\frac{TP}{TP+FN} (Out of all the actual positives, how many were correctly predicted?)
  • TP = True Positives
  • FP = False Positives
  • FN = False Negatives

F1-score
F1-score is a special case of the F-score where β=1\beta = 1.

Normalization

In machine learning, normalization is a technique to rescale data or features into a fixed range, usually [0,1].

Min-Max Normalization is a commonly used method for normalization.

  • Rescale data into [0,1]

    X=xxminxmaxxminX' = \frac{x-x_min}{x_max-x_min}

    This is the default choice, commonly used in deep learning and feature scaling.

  • Recalse data into [-1,1]

    x=2xxminxmaxxmin1x' = 2 * \frac{x-x_min}{x_max-x_min} - 1

    Often used in neural networks(e.g., when using tanh activation function) because it helps keep data centered around 0, improving convergence speed.

Z-score normalization(Standardization)

Standardization is similar to normalization, but instead of scaling to a fixed range, it transforms data to have a mean(η\eta) of 0 and a standard deviation(σ\sigma) of 1 (i.e., standard normal distribution).
Both Normalizaiton and Standardization help adjust data distribution to imporve numerical stability and prevent issues like gradient explosion or vanishing in deep learning models.

x=xησx' = \frac{x-\eta}{\sigma}

Scaling to Unit Length:

Annother normalization approach is scaling input vectors individually to unit norm (vector length), where the value of each feature is divided by the Euclidean length of the vector.

Overfitting

Increasing training dataset

A good rule of thumb is that the total number of training data points should be at least 2 to 3 times larger than the number of parameters in the neural network, although the precise number of data instances depends on the specific model at hand.
A rule of thumb that has been around for almost as long as the MLP itself is that you should use a number of training examples that is at least 10 times the number of weights.

Early Stopping

Keep monitor the accuracy changing on validation dataset or training dataset, if the accuracy has saturated, we stop training.

Regularization

Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function. It helps the model generalize better to unseen data by discouraging excessive complexity.
L1 regularization(Lasso Regression)
Adds the sum of absolute values of weights to the cost/loss function

E=E0+λ2NwwiL1(9)E = E_0 + \frac{\lambda}{2N} \sum\limits_{w}|w_i| \qquad L_1 \qquad(9)
  • E0E_0 is the un-regularized cost/error function Encourages sparsity, meaning some weights become exactly zero, leading to feature selection.

L2 regularization
Adds the sum of squared values of weights to the cost/loss function:

E=E0+λ2Nwwi2L2(10)E = E_0 + \frac{\lambda}{2N} \sum\limits_{w}w_i^2 \qquad L_2 \qquad(10)
  • E0E_0 is the un-regularized cost/error function Penalizes large weights, leading to smaller but nonzero coefficients, helping with stability.

Elastic Net Regularization
A combination of L1 and L2 regularization, balancing sparsity and stability:

Loss=Lossoriginal+λ1w+λ2w2Loss = Loss_original + \lambda_1 \sum|w| + \lambda_2 \sum w^2

Dropout
Dropout is a radically different technique for regularization. Unlike L1 and L2 regularization, dropout doesn’t rely on modifying the cost function. Instead, in dropout we modify the network itself.
With dropout, the training process is modified. We start by randomly(and temporarily) deleting half of the hidden neurons in the network, while leaving the input and output neurons untouched.

Local Minima

Momentum

Momentum is an optimization technique that helps accelerate gradient descent and escape local minima by adding an exponentially weighted moving average of past gradients.

wtwt1+ηΔE(wt1)+αΔwt1w_t \leftarrow w_{t-1} + \eta \Delta E(w_{t-1}) + \alpha \Delta w_{t-1}
- {% katex %} w_t {% endkatex %} current weight in time t
- {% katex %} w_{t-1} {% endkatext %} weight at previous time step
- {% katex %} \eta {% endkatext %} : learning rate
- {% katex %} \Delta E(w_{t-1}) {% endkatext %}: gradient 
- {% katex %} \alpha {% endkatext %}: momentum constant , {% katex %} 0< \alpha < 1 {% endkatext %}, usually use 0.9 .
- {% katex %} \Delta w_{t-1} {% endkatext %} : the update value in previous time step, {% katex %} \Delta w_{t-1} = w_{t-1} - w_{t-2} {% endkatext %} 

Sequential Training

Batch gradient descent

All of the training examples are presented to the neural network, the average sum-of-squares error is then computed, and this is used to update the weights. Thus there is only one set of weight updates for each epoch ( pass through all the training examples or choose a batch size)
In this method the errors are computed and the weights updated after each input. This is not guaranteed to be as efficient in learning, but it is simpler to program when using loops, end it is therefor much more common.

Stochastic Gradient Descent (SGD)

  • An optimization algorithm that updates model parameters using one random sample at a time.
  • Faster but has highter variance in updates.
  • Helps escape local minima due to its randomness.

Mini-Batch Gradient Descent

Mini-batch is a trining method that balances Stochastic Gradient Descent(SGD) and Batch Gradient Descent:

  • Instead of using one sample(SGD) or the entire dataset (Batch GD), mini-batch splits data into small groups(mini-batches).
  • Model updates weights after processing each mini-batch.

Local and Global Relationships

Feature Local Relationships Global Relationships
Scope Close, neighboring elements Distant, non-adjacent elements
CNN Capture Convolutional kernels Larger kernels, pooling, or other techniques
Importance Basic feature extraction High-level understanding, context

Attention Block

The attention mechanism is a fundamental concept in modern neural networks, particularly in natural language processing(NLP), computer vision, and other sequence-based tasks. It allows a model to focus on the most relevant parts of the input data when making predictions, mimicking the way humans pay attention to specific details while ignoring less important information.

What is an Attention Block?
An attention block is a module within a neural network that computes a weighted sum of input features, where the weights are dynamically determined based on the relevance of each feature to the task at hand. These weights are often refered to as attention scores. The attention block enables the model to:

  1. Focus on the most important parts of the input
  2. Handle long-range dependencies in sequential data(e.g., in text or time series).
  3. Improve the interpretability of the model by revealing which parts of the input are being prioritized.

attention.py

SE - Squeeze and Excitation Block

Focus: Channel attention. It learns to weight the importance of different channels in a feature map.
Mechanism:

  1. Squeeze: Global average pooling reduces spatial dimensions, creating a channel-wise descriptor.
  2. Excitation: Two fully connected layers learn channel-wise scaling factors.
  3. Scaling: The original feature map is multiplied by the learned scaling factors.

ECA - Efficient Channel Attention Block

Focus: Channel attention, with improved efficiency.
Mechanism:

  1. Global Average Pooling: Similar to SE block.
  2. 1D Convolution: Instead of fully connected layers, it uses a 1D convolution with adaptive kernel size to capture local cross-channel interactions.
  3. Scaling: Same as SE block.
    Key Improvement: More efficient than SE block due to the 1D convolution.

CBAM - convolutional Block Attention Block

Focus: Both channel and spatial attention.
Mechanism:

  1. Channel Attention: Similar to SE block, but uses both global average pooling and global max pooling.
  2. Spatial Attention: Learns to weight the importance of different spatial locations in the feature map using convolutional layers.
  3. Combination: The channel and spatial attention maps are multiplied with the input feature map.

Loss Functions

Cross_entropy

The cross-entropy loss function, also known as log loss or binary cross-entropy (in the binary case), is a widely used loss function in machine learning, especially for classification tasks. It quantifies the difference between two probability distributions: the predicted probability distribution generated by a model and the true (or target) probability distribution.

Purpose:

  • Measures the difference between predicted and true probabilities: Cross-entropy loss penalizes models that produce predictions that are far from the actual labels.
  • Drives learning in classification: By minimizing the cross-entropy loss during training, the model learns to output probability distributions that are closer to the true distributions of the classes.
  • Well-behaved gradient: The cross-entropy loss function has a nice gradient that makes it suitable for optimization algorithms like gradient descent. This means the model can effectively learn from its errors.

Mathematical Formulation:

  1. Binary Cross-Entropy (for two classes):
    Let:
    • N be the number of samples.
    • yiy_i be the true label fro the i-th sample (either 0 or 1)
    • pip_i be the predicted probability of the i-th sample belonging to class 1 (so, 1pi1 - p_i is the predicted probability of belonging to class 0).

The binary cross-entropy loss L is calculated as:

L=(1/N)[yilog(pi)+(1yi)log(1pi)],i=1 to NL = -(1/N) * \sum [y_i * log(p_i) + (1 - y_i)*log(1 - p_i)], i = \text{1 to N}

Explanation:

  • If the true label yiy_i is 1, the second term (1yi)log(1pi)(1-y_i) * log(1 - p_i) becomes zero, and the loss depends on log(pi)-log(p_i) . If the prediction pip_i is close to 1 (correct), the loss is close to 0. If pip_i is close to 0 (incorrect), the loss becomes very large.
  • If the true label yiy_i is 0 , the first term yilog(pi)y_i * \log(p_i) becomes zero, and the loss depends on log(1pi)-log(1-p_i) . If the prediction pip_i is close to 0 (correct, meaning 1pi1- p_i is close to 1), the loss is close to 0. If pip_i is close to 1 (incorrect, meaning 1pi1- p_i is close to 0), the loss becomes very large.
  • The negative sign ensures that the loss is a positiv value.
  • The (1/N)(1/N) factor averages the loss over all samples.
  1. Categorical Cross-Entropy (for multiple classes):

let

  • N be the number of samples.
  • C be the number of classes.
  • yicy_{ic} be a binary indicator (0 or 1) that is 1 if the i-th sample belongs to class c, and 0 otherwise (one-hot encoded true label).
  • picp_{ic} be the predicted probability of the i-th sample belonging to class c.

The categorical cross-entropy loss L is calculated as:
L=(1/N)[yiclog(pic)],i=1 to N,c=1 to C L = - (1/N) * \sum \sum [y_{ic} * log(p_{ic})], i = \text{1 to N}, c = \text{1 to C}

Explanation:

  • For each sample i, we iterate through all classes c.
  • The term yiclog(pic)y_{ic} * log(p_{ic}) is only non-zero when the true label yicy_{ic} is 1 (i.e., the sample belongs to class c). In this case, the loss depends on log(pic)-log(p_{ic}) .
  • Similar to the binary case, if the predicted probability picp_{ic} for the correct class is high, the loss is close to 0. If it’s low, the loss is high.

Focal loss

The focal loss function is a cross-entropy with a weighted alpha to balance the uneven proportion of the positive and negative samples and solve the imbalance between easy and hard samples. U-CARFnet adopted the focal loss function to optimize the training performance under severe class imbalance.
The binary cross entropy loss function, binary_crossentropy:

CE(p,y)=ylogp(1y)log(1p){log(p)y=1log(1p)otherwiseCE(p,y) = -y\log p - (1-y)log(1-p) \begin{cases} - \log (p) & y = 1 \\ - \log (1-p) & \text{otherwise} \end{cases}

pt={pify=11potherwisep_t = \begin{cases} p & \text{if} y = 1 \\ 1-p & otherwise \end{cases}

CE(p,y)=CE(pt)=log(pt)CE(p,y) = CE(p_t) = - \log (p_t)
  • p is the test result
  • y is the actual label

We can observe from above equation that, when using the standard cross-entropy loss function, the loss decreases as the model’s predicted probability increases for positive samples, and decreases as the predicted probability decreases for negative samples.
However, during training, the loss function tends to update slowly for many easy samples (samples that the model already classifies correctly). This can make the optimization process less efficient, potentially preventing the model from reaching its best performance.

U-CARFnet introduced focal loss function can solve such problems.

CE(pt)=αlog(pt)\text{CE}(p_t) = -\alpha \log(p_t)
FL(pt)=(1pt)γlog(pt)\text{FL}(p_t) = -(1-p_t)^\gamma \log (p_t)
FL(pt)=αt(1pt)γlog(pt)\text{FL}(p_t) = - \alpha _t(1-p_t)^\gamma \log (p_t)

where CE(pt)\text{CE}(p_t) can solve the calss imbalance, while FL(pt)\text{FL}(p_t) can solve the imbalance between easy and hard samples.
Above two separate improvements are conined to form the focal loss function in the form of FL(pt)\text{FL}(p_t) . This solves the imbalance between positive and negative samples and between the easy and hard samples.

Dice loss

U-CARFnet

Development

U-CARFnet introduced three developments.

1.Attention Mechanism

SENet, CBAM and ECANet.

2.Residual learning module

U-CARFnet applied residual learning modules which was proposed in the ResNet model. To solve infomation loss problems during transmission by traditional convolutional networks or fully connected networks. This structure can accelerate the training of neural networks and greatly enhance the model’s accuracy.

3.Focal Loss Function

model

Above figure shows the residual learning module and the CEIB learning module modified from the deformable convolutional block integrated with the attention mechanism in the U-CARFnet model. In general, the deformable convolutional block can locally perceive each feature in the image using the convolution layer. The local features are integrated at a high level to extract the global feature information. The learning ability of skip connection in the residual learning module is then utilized to transmit the low-level and high-level information to the next stage. The features are extracted using the next group of the convolution operation. Finally, the attention learning module is introduced to optimize the information learning and extraction ability and solve the problem of information error and loss during learning and transmission.

HED

Deep Supervision and Weighted Fusion Supervision

  1. Deep Supervision
    Deep supervision refers to the technique of introducing additional supervision signals at multiple layers of a neural network (typically intermediate layers), rather than applying supervision only at the final outpu layer.For purposes of mitigating gradient vanishing, enhancing feature learning and capturing multi-scale features.
    HED introduces side-output layers at multiple convolutional layers of VGGNet( e.g., conv1_2, conv2_2, conv3_3, conv4_3, conv5_3).
    Each side-output layer is associated with a supervision signal(e.g., cross-entropy loss) to train the predictions at that layer.
  2. Weighted Fusion Supervision
    Weighted fusion supervision refers to the process of combining predictions from multiple layers(e.g., side-outputs from deep supervision) through a weighted fusion mechanism to produce the final output. For purposes of integrating multi-scale information, optimizing final output.
    HED fuses the predictions from multiple side-output layers using learned or manually set weights.
    The fused result is used as the final edge detection map.

Consensus Sampling

Consensus sampling is a technique that combines predictions or decisions from multiple models or data sources to produce a more reliable and robust output by selecting results that achieve majority agreement.
The core idea is leveraging the principle that agreement among independent models or data sources reduces the impact of individual errors or noise.
Key Methods:

  • Majority Voting: For classification tasks, the most frequestly predicted class is selected.
  • Weighted Avaraging: For regression tasks, predictions are combined using weighted averages, where weights reflect model confidence or performance.
  • Threshold-Based Selection: Results are accepted only if they meet a predefined agreement threshold(e.g., 80% of models agree)

Tensorflow

Layers

Convolutional

Padding

  • valid
    No padding is added to the input. The output feature map is smaller than the input. The output size decreases based on the kernel size and stride.

  • same
    The output size stays the same as the input, with padding applied to the borders of the input.

BatchNormalization

BatchNormalization is used to improve the stability, speed, and performance of neural networks during training. It works by normalizing the input of each layer to have a mean of zero and a standard deviation of one. This is done by adjusting and scaling the activations during training. This is done by adjusting and scaling the activations during training.

  1. Normalization: For each mini-batch during traing, BatchNormalization normalizes the output of the previous layer by subtracting the batch mean and dividing by the batch standard deviation. This helps in reducing internal covariate shift, which is the change in the distribution of network activations due to the change in network parameters during training.
  2. Scaling and shifting: After normalization, the data is scaled and shifted using two leanable parameters, gamma and beta. These parameters allow the network to undo the normalization if it learns that the original distribution was better for the task.

    The Factory Analogy for Batch Normalization
    Imagine a neural network as a factory with multiple workshops(layers) in a production line. Each workshop takes the output from the previous workshop, performs some processing, and passes it on to the next one.
    Internal Convariate Shift: This is like the raw materials(data) coming into each workshop varying wildly in quality and consistency. One day it’s all large, hight-quality pieces, the next day it’s small, low-quality scraps. Each workshop has to constantly adjust its machines and processes to handle this changing input, making production slow and inefficient.
    Batch Normalization: This is like installing a quality control and standardization unit at the entrance of each workshop. This unit takes the incoming raw materials(data from the previous layer), measures their average size and quality, and then processes them to ensure they are all of a consistent, standard size and quality.
    By standardizing the input to each workshop(layer), Batch Normalization ensures that each workshop receives a consistent flow of data. This means the workshops don’t have to constantly readjust their processes, making them more efficient.
    Batch Normalization acts like a buffer that smooths out the variations in data flow between layers, allowing each layer to focus on learning the core patterns without being disrupted by changes in the overall data distribution.

The key components of batch normalization are:

  1. Mini-batch statistics:
    For a given feature in a mini-batch, we calculate:
  • Mean(μB\mu_B) of that feature across all samples in the mini-batch:

    μB=1mi=1mxi \mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i

    where m is the number of samples in the mini-batch, and xix_i is the value of the feature for sample i;

  • Variance(σB2\sigma_B^2 ) of that feature:

    σB2=1mi=1m(xiμB)2 \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2

    This tells us how spread out the values are for the feature.

  1. Normalization:
    The calculated mean and variance are used to normalize each feature:

    x^=xiμB(σB2+ϵ) \hat{x} = \frac{x_i-\mu_B}{\sqrt(\sigma^2_B + \epsilon)}
  • xix_i is the original value for sample i,
  • μB\mu_B is the mean of the feature in the mini-batch
  • σB2\sigma_B^2 is the variance of the feature in the mini-batch
  • ϵ\epsilon is a small constant (like 10510^{-5} ) added to avoid division by zero. This process makes sure that the feature values are centered around 0 with a standard deviation of 1.
  1. Scale and Shift
    After normalizing, we scale and shift the values using learnable parameters γ\gamma and β\beta to allow the model to adjust:

    yi=γxi^+β y_i = \gamma \hat{x_i} + \beta
  • xi^\hat{x_i} is the normalized value
  • γ\gamma is the scaling factor (learned by the model)
  • β\beta is the shifing factor (also learned by the model)

GlobalAveragePooling2D

GLobalAveragePooling2D layer takes a 4D tensor as input ( typically the output of a convolutional layer ) and performs global average pooling across the spatial dimensions( height and width).

How it works:

  • It calculates the average value for each channel across the entire spatial dimensions( height and width) of the input tensor.
  • It outputs a 2D tensor where each element represents the average value for a specific channel
    Key role:
  • Reduces dimensionality: It significantly reduces the number of parameters in the model by replacing spatial dimensions with channel-wise averages.
  • Feature aggregation: It aggregates spatial information into a compact representation for each channel.
  • Regularization: It can help prevent overfitting by reducing the number of parameters

Dense

Dense Layers are the most basic and widely used type of layer in neural networks. In a dense layer, every neuron is connected to every neuron in the previous layer.
How they work:

  • Linear Transformation: The input is multiplied by a weight matrix and then a bias is added.
  • Activation Function: A non-linear function (like ReLU, sigmoid, etc.) is applied element-wise to the result. This introduces non-linearity, which is essential for learning complex patterns.
    Key role:
  • Feature Extraction: Dense layers learn to combine features from the previous layer in complex ways.
  • Classification/Regression: Often used as the final layer(s) to make predictions

Upsampling

In-Network Bilinear Interpolation

In-Network bilinear interpolation is a fixed upsampling method used to increase the resolution of feature maps by interpolating values based on weighted averages of neighboring pixels.
The interpolation weights are predefined and do not change during training. It’s computationally lishgtweight, makeing it suitable for real-time applications.
This upsampling method was used in semantic segmentation e.g., FCN, U-Net

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import tensorflow as tf
def in_network_bilinear_interpolation(input_tensor, scale_factor):
_, h, w, _ = input_tensor.get_shape().as_list()
nh, nw = h * scale_factor, w * scale_factor
row_indices = tf.clip_by_value(tf.cast(tf.floor(tf.linspace(0.0, float(h-1), nh)), tf.int32), 0, h-1)
col_indices = tf.clip_by_value(tf.cast(tf.floor(tf.linspace(0.0, float(w-1), nw)), tf.int32), 0, w-1)
row_ceil = tf.minimum(row_indices + 1, h - 1) # Ensure we don't go out of bounds
col_ceil = tf.minimum(col_indices + 1, w - 1)

row_weight = tf.cast(row_indices, tf.float32) - tf.floor(tf.cast(row_indices, tf.float32))
col_weight = tf.cast(col_indices, tf.float32) - tf.floor(tf.cast(col_indices, tf.float32))

gather_indices = tf.stack([
tf.stack([row_indices, col_indices], axis=-1),
tf.stack([row_indices, col_ceil], axis=-1),
tf.stack([row_ceil, col_indices], axis=-1),
tf.stack([row_ceil, col_ceil], axis=-1)
], axis=0)

gathered_values = tf.gather_nd(input_tensor, gather_indices) # Shape: (4, nh, nw, channels)

# Efficiently reshape weights
row_weight = tf.reshape(row_weight, (1, nh, 1, 1))
col_weight = tf.reshape(col_weight, (1, 1, nw, 1))

interpolated = (
gathered_values[0] * (1 - row_weight) * (1 - col_weight) +
gathered_values[1] * (1 - row_weight) * col_weight +
gathered_values[2] * row_weight * (1 - col_weight) +
gathered_values[3] * row_weight * col_weight
)

interpolated = tf.reshape(interpolated, [1, nh, nw, -1]) # Add back batch dim
return interpolated

In-Network Deconvolutional Layer

In-Network deconvolutional layer is a learnable upsampling method, also known as a transposed convolutional layer, that increase the resolution of feature maps by performing the inverse of a convolution operation.
The weights are optimized during training, allowing the layer to adapt to specific tasks. It’s capable of learning complex upsampling patterns, making it suitable for high-precision tasks. It’s more computationally expensive due to learnable weights.
In-Network deconvolutional layer was used in image generation in Generative Adversarial Networks, and in semantic segmentation e.g., SegNet, DeepLab. It is decoder part of autoencoders.

1
2
3
import tensorflow as tf
def in_network_deconvolution(input_tensor, filters, ks, strides, padding, activation=None):
return tf.layers.conv2d_transpose(input_tensor, filters, ks, strides, padding, activation=activation)

kernel_initializer

when construct a convolutional layer, we need to specify the kernel_initializer

1.glorot_uniform (Xavier Uniform) default

It is designed to maintain a balance between variance of activations and gradients across layers, helping to prevent vanishing or exploding gradients.

WU(6(nin+nout),6nin+nout) W \sim U ( - \frac{\sqrt{6}}{\sqrt(n_{in} + n_{out})} , \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}} )
  • ninn_{in} = Number of inpurt units to the layer
  • noutn_{out} = Number of output units
  • U(a,b): uniform distribution in range of [a, b]

This ensures that the variance of activations remains consistent throughout the network.

How to get a uniform distribution array or matrix in with python

1
2
3
4
import numpy as np
def glorot_uniform(shape, n_in, n_out):
limit = np.sqrt(6/(n_in+n_out))
return np.random.uniform(low=-limit, hight=limit, size = shape)

When to use:

  • works well with sigmoid, tanh, and softmax activations
  • Suitable for both shallow and deep networks, but for ReLU-based networks, HeNormal is often preferred.

2.he_normal

It is designed specifically for deep neural networks using ReLU and its variants(e.g., LeakyReLU, ELU). It helps prevent vanishing or exploding gradients by ensuring that the variance of activations remains stable throughout the layers.

ReLU: Rectified Linear Unit

WN(0,2nin)W \sim N(0, \frac{2}{n_{in}})
  • ninn_{in} is the number of input units to the layer
  • N(μ,σ)N(\mu, \sigma) : is normal distribution, μ\mu is mean, σ\sigma is standard deviation.
  • The variance 2nin\frac{2}{n_{in}} is derived from keeping the variance of activations stable in ReLU networks.

How to generate the random matrix with normal distribution in python

1
2
3
4
import numpy as np
def he_normal(shape, n_in):
std_dev = np.sqrt(2/n_in)
return np.random.normal(loc=0.0, scale=std_dev, size= shape)

Load Dataset

  1. load dataset from folder

    1
    2
    3
    4
    5
    6
    7
    import tensorflow as tf
    train_dataset = tf.keras.utils.image_dataset_from_directory(
    "path_to_data",
    image_size = (224,224),
    batch_size = 32,
    label_mode = "categorical" # "int", "categorical", "binary"
    )
  2. load dataset from NumPy array

    1
    2
    3
    4
    5
    6
    import numpy as np
    # suppose there are 1000 224*224 RGB images
    x_train = np.random.rand(1000, 224, 224, 3)
    y_train = np.random.randint(0, 10, (1000, )) # 10 classes

    train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(32)
  3. load dataset from TFRecord file
    if the data is in TFRecord format, the t.data.TFRecordDataset is suggested to parse dataset:

    1
    2
    raw_dataset = tf.data.TFRecordDataset("path_to_data.tfrecord")
    train_dataset = raw_dataset.map(parse_function).batch(32)

data structure of train_dataset

1
2
for images, labels in train_dataset.take(1)
print(images.shape, labels.shape)

output
(32,224,224,3)(32,num_classes) # 32 images of size 224*224 and RGB mode

Data Augmentation

In TensorFlow, using train_dataset.map(data_augmentation) applies data augmentation dynamically during training. This means that every time an image is fed to the model, it is modified in real-time based on the data_augmentation function (e.g., random flips, rotatio, brightness adjustments). This augmentation happens per epoch, so the images will vary each time the model trains on them, but the size of the dataset remains unchanged.

Difference between dynamic and static augmentation:

  1. Dynamic Augmentation (via dataset.map(data_augementation)):
    • No increase in dataset size: The dataset’s size remains the same. Augmented images are generated on-the-fly during training.
    • Memory Efficient: It doesn’t require additional storage since the augmentaed images are generated during training.
    • Flexibility: Every epoch sees different augmented versions of the images, improving model robustness.
  2. Static Augmentation (pre-generating more images):
    • Increase in dataset size: Augmentated images are generated and stored beforehand, increaseing the total dataset size.
    • Higher Memory Usage: Requires more disk space and memory to store all augmented images.

Type

In TensorFlow, using train_dataset.map(data_augmentation) applies data augmentation dynamically during training.

Processing

  1. Apply image augmentation directly on train_dataset

    1
    2
    3
    4
    5
    6
    import tensorflow as tf
    def augment(image, label):
    image = tf.image.random.flip_left_right(image)
    image = tf.image.random_brightness(image, max_delta=0.2)
    return image, label
    train_dataset = train_dataset.map(augment)
  2. ImageDataGenerator (deprecated)
    GPU is not supported, and cannot work with tf.data.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    from tensorflow.keras.preprocesing.image import ImageDataGenerator
    datagen = ImageDataGenerator(
    rescale = 1./255,
    rotation_range = 40,
    width_shift_range = 0.2,
    height_shift_range = 0.2,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True
    )
    train_generator = datagen.flow_from_directory(
    "path_to_data",
    target_size = (224,224),
    batch_size = 32,
    class_mode = "categorical"
    )

Model

Optimizer

optimizers.Adam()

optimizers.SGD

optimizers.RMSprop()

optimizer.Adagrad()

optimizer.Adadelta()

optimizer.FTRL

model training

model.fit( )

fit(
    x = None,
    y = None,
    batch_size = None,
    epochs = 1,
    verbose = 'auto',
    callbacks = None,
    validation_split = 0.0,
    validation_data = None,
    shuffle= True,
    class_weight = None,
    sample_weight = None,
    initial_epoch = 0,
    steps_per_epoch = None,
    validation_steps = None, 
    validation_batch_size = None,
    validation_freq = 1
)

Tools

TensorBoard

TensorBoard is the built-in graphic tool to show the model and training status.

  • add TensorBoard in the training process
1
2
3
log_dir = 'logs/fit'
tensorboard_callback = callbacks.TensorBoard(log_dir = log_dir, histogram_freq = 1)
history = model.fit(gen, verbose =1, steps_per_epoch =30, epochs = epochs, callbacks = [tensorboard_callback])
  • start the tensorboard in command line
    1
    tensorboard --logdir logs/fit