Home AI/ML Transfer Learning, Fine-Tuning, and Domain Adaptation: A Complete Guide with Anomaly Detection for Heterogeneous Cobots

Transfer Learning, Fine-Tuning, and Domain Adaptation: A Complete Guide with Anomaly Detection for Heterogeneous Cobots

Last updated: May 27, 2026
k
Published April 5, 2026 · Updated May 27, 2026 · 44 min read

Summary

What this post covers: A clear separation of transfer learning, fine-tuning, and domain adaptation as a hierarchy of techniques, applied to the concrete problem of building a cross-brand anomaly detection model for heterogeneous collaborative robot fleets with runnable PyTorch examples.

Key insights:

  • Transfer learning is the umbrella paradigm; fine-tuning, domain adaptation, feature extraction, multi-task learning, and few-shot transfer are sibling techniques within it, not synonyms, getting this hierarchy right prevents most conceptual errors.
  • For heterogeneous cobot fleets, the cheapest effective starting point is per-channel sensor normalization plus fine-tuning only the batch normalization layers, this requires almost no target labels and can be deployed in hours.
  • When BN-only adaptation falls short, escalate to adversarial domain adaptation (DANN) or supervised contrastive methods, which align source and target feature distributions even without target labels.
  • Inference latency requirements drive architecture choice: a 500K-parameter CNN runs in under 5ms on Jetson hardware suitable for collision avoidance, while transformer-based models typically require cloud deployment unsuitable for real-time safety detection.
  • The hardest part of cross-brand cobot anomaly detection is not the algorithm but data collection and a consistent labeling protocol that domain experts can apply across brands, firmware versions, and operating conditions.

Main topics: Transfer Learning, The Big Picture, Fine-Tuning—Techniques and Strategies, Domain Adaptation—Bridging the Distribution Gap, The Cobot Anomaly Detection Scenario, Practical Implementation Guide, Putting It Together, References.

Consider a Universal Robots UR5e and a FANUC CRX-10iA on the same production line, performing identical pick-and-place operations. Both have six joints, both lift the same payload, and both generate streams of torque, position, and velocity data every millisecond. Yet when an anomaly detection model trained on the UR5e’s data is deployed on the FANUC—despite the identity of the task—the model flags nearly everything as anomalous. The sensor noise profiles differ, the control loop frequencies do not match, and the calibration offsets produce entirely different data distributions. The model understands what “normal” looks like for one robot, but is effectively blind to normalcy on another.

This is not a hypothetical problem. As collaborative robots (cobots) proliferate across manufacturing, logistics, and healthcare, organisations increasingly operate heterogeneous fleets that span multiple brands, generations, and firmware versions. Training a separate anomaly detection model for every brand is expensive, slow, and inefficient. The question is whether a model can transfer its understanding of normal robot behaviour across brands.

This is precisely the problem that transfer learning, fine-tuning, and domain adaptation were designed to address. The following sections examine these three concepts, clarify how they relate to one another, and apply them to a concrete scenario: building a cross-brand anomaly detection system for heterogeneous cobots. The treatment provides both theoretical understanding and complete, runnable PyTorch code for several adaptation strategies.

Key Takeaway: Transfer learning is the umbrella paradigm. Fine-tuning and domain adaptation are specific techniques within it. Understanding this hierarchy is essential before proceeding to implementation.

Before proceeding, the conceptual hierarchy that frames the discussion should be made explicit:

Transfer Learning (broad paradigm)
├── Fine-Tuning (retrain pre-trained model on new data)
├── Domain Adaptation (bridge distribution gap between domains)
│   ├── Supervised Domain Adaptation
│   ├── Unsupervised Domain Adaptation (UDA)
│   └── Semi-Supervised Domain Adaptation
├── Feature Extraction (freeze pre-trained layers, train new head)
├── Multi-Task Learning (shared representations)
└── Zero-Shot / Few-Shot Transfer

Transfer learning is the overarching idea: take knowledge learned in one context and apply it in another. Fine-tuning is one mechanism for doing so, in which a pre-trained model is further trained on the target data. Domain adaptation is another mechanism, which specifically addresses the situation in which source and target data come from different distributions. Feature extraction, multi-task learning, and zero- or few-shot transfer are additional strategies under the same umbrella. They are sibling strategies, not synonyms.

With that framework established, each technique is examined in detail below.

Transfer Learning—Source to Target Pipeline Source Domain UR5e Cobot Labeled Data Pre-trained Model 1D-CNN Encoder Learned Features Fine-tuning / Domain Adapt. Adapt to Target Target Domain FANUC / ABB Cobot Few/No Labels Transfer Learning Strategies (siblings, not synonyms): Fine-Tuning Domain Adaptation Feature Extraction Multi-Task Learning Zero / Few-Shot All strategies share one goal: reuse knowledge from source to accelerate learning on the target.

Transfer Learning, The Big Picture

Formal Definition

Transfer learning is the paradigm of using knowledge acquired from a source task or domain to improve learning on a target task or domain. Formally, given a source domain DS with a learning task TS, and a target domain DT with a learning task TT, transfer learning aims to improve the learning of the target predictive function fT(·) using knowledge from DS and TS, where DS ≠ DT or TS ≠ TT.

Expressed informally: resources have already been spent learning something useful in one context. The objective is to reuse that learning rather than start from scratch.

Why Transfer Learning Matters

The motivation is overwhelmingly practical:

  • Limited labelled data. Labelling anomalies in cobot sensor data requires domain experts familiar with both the robot’s kinematics and the manufacturing process. Thousands of labelled samples may be available for one robot brand, but very few for another.
  • Expensive annotation. Each labelled anomaly may require a robotics engineer to review hours of sensor logs. At 150 USD per hour, labelling 10,000 samples across five brands can cost more than the robots themselves.
  • Faster convergence. A model initialised with transferred knowledge reaches acceptable performance in hours rather than weeks.
  • Better generalisation. Features learned from large, diverse datasets often capture general patterns that improve performance even on seemingly unrelated tasks.

Types of Transfer Learning

The taxonomy breaks down based on what differs between source and target:

Type Source Labels Target Labels Relationship Example
Inductive Transfer Available Available TS ≠ TT ImageNet classification → medical image segmentation
Transductive Transfer Available Not available DS ≠ DT, TS = TT UR5e anomaly detection → FANUC anomaly detection (no FANUC labels)
Unsupervised Transfer Not available Not available DS ≠ DT Self-supervised pre-training on all cobot data → clustering

 

For our cobot scenario, transductive transfer is the most relevant: we have labeled anomaly data from one or a few brands (source domains) and want to perform the same anomaly detection task on new brands (target domains) where labels are scarce or nonexistent.

When Transfer Learning Works, and When It Fails

Transfer learning is not a universal solution. It works when source and target share underlying structure. A model trained on ImageNet transfers well to medical imaging because both involve recognising edges, textures, and shapes. A model trained on English text transfers well to French because the two languages share grammatical abstractions.

It fails, sometimes substantially, when source and target are too dissimilar. This is termed negative transfer: the transferred knowledge actively degrades performance on the target task. For example, a model trained on satellite imagery may transfer poorly to microscopy images despite both being images. The spatial scales, textures, and semantic content differ fundamentally.

Caution: Negative transfer is difficult to diagnose because it can resemble a training problem. If a transferred model performs worse than a randomly initialised one, negative transfer should be suspected. The remedy is typically to reduce the amount of knowledge transferred (freeze fewer layers) or to reconsider whether transfer is appropriate at all.

In the cobot scenario, transfer learning is promising because the robots share the same fundamental kinematic structure. A six-axis articulated arm generates torque profiles that follow similar physical laws regardless of brand. The differences arise in sensor calibration, noise characteristics, and control-system specifics—exactly the kind of distribution shift that domain adaptation was designed to handle.

Historical Context

The modern era of transfer learning began with ImageNet. In 2012, AlexNet demonstrated that deep CNNs could learn powerful visual features. By 2014, researchers had observed that these features, especially those from early layers, transferred remarkably well to other vision tasks. “ImageNet pre-training” became the default starting point for nearly every computer vision project.

NLP followed a similar trajectory. Word2Vec and GloVe provided transferable word embeddings, but the broader transformation came with BERT (2018) and GPT (2018–2019), which showed that pre-training on substantial text corpora created representations that transferred to nearly any language task. Today’s large language models are perhaps the most extensive transfer learning systems: pre-trained on trillions of tokens, then fine-tuned or prompted for specific tasks.

Time-series and industrial AI are now undergoing their own transfer learning shift. Models such as Chronos, TimesFM, and Lag-Llama are emerging as foundation models for temporal data, and domain adaptation for sensor data is an active research area with direct industrial application.

Training From Scratch vs. Transfer Learning

Factor From Scratch Transfer Learning
Labeled data needed Large (10k–1M+ samples) Small (100–1k samples)
Training time Days to weeks Hours to days
Compute cost High (multi-GPU) Low to moderate (single GPU)
Performance (limited data) Poor (overfits) Good to excellent
Performance (abundant data) Excellent (eventually) Excellent (faster)
Domain expertise needed High (architecture design) Moderate (strategy selection)
Risk of negative transfer None Possible if domains too different

 

Fine-Tuning—Techniques and Strategies

Fine-tuning is the most widely used transfer learning technique: take a model pre-trained on a source task or domain and continue training it on the target data. The concept is simple, but the practice is nuanced.

Full Fine-Tuning and Partial Fine-Tuning

Full fine-tuning updates all parameters of the pre-trained model. This affords maximum flexibility to adapt, but also presents the highest risk of overfitting, particularly when the target dataset is small. With 50,000 labelled samples in the target domain, full fine-tuning is generally safe. With 500, it is risky.

Partial fine-tuning freezes some layers (typically the earlier ones) and updates only the remainder. The reasoning is that early layers learn generic, transferable features (edge detectors in vision, basic temporal patterns in time-series), while later layers learn task-specific features. Freezing early layers preserves the generic knowledge while adapting the task-specific parts.

Layer-Wise Learning Rate Decay (Discriminative Fine-Tuning)

Rather than imposing a binary freeze/unfreeze decision, discriminative fine-tuning assigns different learning rates to different layers. Earlier layers receive smaller learning rates (they change slowly), while later layers receive larger learning rates (they require more adaptation). A common approach multiplies the learning rate by a decay factor for each layer moving backwards from the output:

# Discriminative learning rates in PyTorch
def get_discriminative_params(model, base_lr=1e-3, decay_factor=0.9):
    """Assign decreasing learning rates to earlier layers."""
    params = []
    layers = list(model.named_parameters())
    n_layers = len(layers)

    for i, (name, param) in enumerate(layers):
        # Earlier layers get smaller LR
        layer_lr = base_lr * (decay_factor ** (n_layers - i - 1))
        params.append({
            'params': param,
            'lr': layer_lr,
            'name': name
        })

    return params

# Usage
param_groups = get_discriminative_params(model, base_lr=1e-3, decay_factor=0.85)
optimizer = torch.optim.AdamW(param_groups)

Gradual Unfreezing

Gradual unfreezing begins by training only the final layer (or layers), then progressively unfreezes earlier layers as training proceeds. This prevents early layers from being corrupted by the large gradients that occur at the start of fine-tuning when the loss is high. The strategy was popularised by ULMFiT (Universal Language Model Fine-tuning) and works well for both NLP and time-series tasks.

The Fine-Tuning Decision Matrix

The appropriate fine-tuning strategy depends on two factors: the amount of available target data and the similarity between source and target domains.

Scenario Target Data Size Domain Similarity Recommended Strategy
A Small (<1k) High Feature extraction only (freeze all, train classifier head)
B Small (<1k) Low Fine-tune final layers with aggressive regularization
C Large (>10k) High Full fine-tuning with small learning rate
D Large (>10k) Low Full fine-tuning or train from scratch

 

For cobots that share kinematic structure but differ in brand, the situation falls firmly in the high domain similarity column. When labelled data for the target brand is limited (a common case), Scenario A applies, calling for feature extraction or minimal fine-tuning. When substantial data is available, Scenario C applies, with gentle full fine-tuning.

Regularisation During Fine-Tuning

Fine-tuning on small datasets risks catastrophic forgetting, in which the model loses what it learned during pre-training. Several regularisation techniques help mitigate this risk:

  • L2-SP (L2 penalty toward starting point). Instead of penalising weights toward zero, penalise them toward their pre-trained values. This keeps the model close to the pre-trained solution while allowing adaptation.
  • Dropout. Especially effective when added to fine-tuning layers. Typical values are 0.1 to 0.3 during fine-tuning, compared with 0.5 during training from scratch.
  • Early stopping. Monitor validation loss on the target domain and halt training when it begins to increase. With small target datasets, overfitting can occur within a few epochs.
  • Weight decay. Standard L2 regularisation remains effective, typically at 0.01 to 0.1 during fine-tuning.

Modern Parameter-Efficient Fine-Tuning

Full fine-tuning updates millions or billions of parameters, which is computationally expensive and requires storing a full copy of the model per task. Parameter-efficient fine-tuning (PEFT) methods address this constraint by updating only a small subset of parameters:

  • LoRA (Low-Rank Adaptation). Injects low-rank matrices into each layer. Rather than updating a weight matrix W directly, LoRA decomposes the update as ΔW = BA, where B and A are low-rank matrices. This reduces trainable parameters by a factor of approximately 10,000 while preserving performance.
  • QLoRA. Combines LoRA with 4-bit quantisation of the base model, enabling fine-tuning of large models on a single consumer GPU.
  • Adapters. Small bottleneck modules inserted between existing layers. Only adapter parameters are trained; the remainder remains frozen.
  • Prefix Tuning and Prompt Tuning. Prepend learnable vectors to the input or hidden states. These approaches originated in NLP but are conceptually applicable to any sequence model.
Tip: For the cobot scenario, LoRA is particularly attractive. A practitioner can maintain a single base anomaly detection model and keep small per-brand LoRA adapters (a few MB each). Switching between brands consists of swapping the adapter weights.

Fine-Tuning Code Example

The following is a complete example of fine-tuning a PyTorch model with layer freezing and discriminative learning rates for a time-series anomaly detection task:

import torch
import torch.nn as nn


class CobotAnomalyModel(nn.Module):
    """1D-CNN feature extractor + classifier for cobot anomaly detection."""

    def __init__(self, n_joints=6, n_features_per_joint=4, seq_len=200):
        super().__init__()
        in_channels = n_joints * n_features_per_joint  # 24 input channels

        # Feature extractor (transferable layers)
        self.features = nn.Sequential(
            nn.Conv1d(in_channels, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1)
        )

        # Classifier head (task-specific)
        self.classifier = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 2)  # normal vs anomaly
        )

    def forward(self, x):
        # x shape: (batch, channels, seq_len)
        feat = self.features(x).squeeze(-1)
        return self.classifier(feat)


def fine_tune_for_new_brand(
    pretrained_model,
    target_loader,
    val_loader,
    freeze_features=True,
    base_lr=1e-3,
    n_epochs=30
):
    """Fine-tune a pre-trained cobot model for a new brand."""
    model = pretrained_model

    if freeze_features:
        # Strategy A: freeze feature extractor, train only classifier
        for param in model.features.parameters():
            param.requires_grad = False
        optimizer = torch.optim.Adam(
            model.classifier.parameters(), lr=base_lr
        )
    else:
        # Strategy C: discriminative learning rates
        param_groups = [
            {'params': model.features.parameters(), 'lr': base_lr * 0.1},
            {'params': model.classifier.parameters(), 'lr': base_lr},
        ]
        optimizer = torch.optim.Adam(param_groups)

    criterion = nn.CrossEntropyLoss()
    best_val_loss = float('inf')
    patience_counter = 0

    for epoch in range(n_epochs):
        model.train()
        for batch_x, batch_y in target_loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()

        # Validation and early stopping
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch_x, batch_y in val_loader:
                output = model(batch_x)
                val_loss += criterion(output, batch_y).item()

        val_loss /= len(val_loader)
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            torch.save(model.state_dict(), 'best_model.pt')
        else:
            patience_counter += 1
            if patience_counter >= 5:
                print(f"Early stopping at epoch {epoch}")
                break

    model.load_state_dict(torch.load('best_model.pt'))
    return model

Fine-Tuning Strategy Selection Matrix ↑ Low Domain Similarity (High Distribution Gap) Target Data Size Small Data · Low Similarity Freeze all layers Train classifier head only + Aggressive regularization Scenario B Small Data · High Similarity Feature extraction Freeze feature extractor Cobot cross-brand: ideal fit Scenario A ← You are here Large Data · Low Similarity Full fine-tuning or train from scratch Scenario D Large Data · High Similarity Full fine-tuning Small learning rate (1e-4) Scenario C ← Small Target Data Large Target Data →

Domain Adaptation: Bridging the Distribution Gap

Whereas fine-tuning assumes that at least some labelled data is available in the target domain, domain adaptation addresses a harder problem: substantial labelled data in the source domain, but no labels at all in the target domain. This is unsupervised domain adaptation (UDA), the most common and challenging scenario in real-world deployments.

Formal Definition

In domain adaptation, source and target domains share the same task (for example, anomaly detection) but have different data distributions. Formally: PS(X) ≠ PT(X), while the labelling function is identical. The objective is to learn a model that performs well on the target distribution despite being trained primarily on the source distribution.

Several types of distribution shift can occur:

  • Covariate shift. P(X) changes while P(Y|X) remains constant. The input distributions differ but the relationship between inputs and outputs is preserved. This is the most common scenario for cobots: sensor data distributions differ across brands, while the definition of “anomaly” remains consistent.
  • Label shift. P(Y) changes while P(X|Y) remains constant. The prior probability of classes changes. For example, one brand may have a 2% anomaly rate while another has 5%.
  • Concept drift. P(Y|X) changes—the same input has different meanings in different domains. This is rare for same-structure cobots but can arise when different brands define “normal operating range” differently.

Key Unsupervised Domain Adaptation Methods

Discrepancy-Based Methods

These methods explicitly measure and minimise the distance between source and target feature distributions.

Maximum Mean Discrepancy (MMD) measures the distance between two distributions by comparing their mean embeddings in a reproducing kernel Hilbert space (RKHS). If the mean embeddings are identical, the distributions are identical (for characteristic kernels). In practice, an MMD penalty is added to the training loss to encourage the network to produce similar feature distributions for source and target data.

CORAL (CORrelation ALignment) aligns the second-order statistics (covariance matrices) of source and target features. Deep CORAL integrates this alignment into the network by adding a CORAL loss at one or more hidden layers. The CORAL loss is the Frobenius norm of the difference between source and target covariance matrices.

Adversarial-Based Methods

These methods use an adversarial framework to learn domain-invariant features—features that are useful for the task but cannot be used by a discriminator to distinguish between source and target domains.

Domain-Adversarial Neural Networks (DANN) represent the principal approach. The architecture has three components: a shared feature extractor, a task classifier (for anomaly detection), and a domain discriminator. The key element is the gradient reversal layer (GRL): during backpropagation, gradients from the domain discriminator are reversed before reaching the feature extractor. The feature extractor is thus trained to maximise the domain discriminator’s loss—that is, to produce features that confuse the discriminator about which domain the data came from.

ADDA (Adversarial Discriminative Domain Adaptation) uses separate feature extractors for source and target, with the target extractor initialised from the source. The adversarial dynamic operates between the target encoder and the discriminator.

CyCADA (Cycle-Consistent Adversarial Domain Adaptation) combines pixel-level adaptation (using CycleGAN-style image translation) with feature-level adaptation. Although primarily used for visual tasks, the concept of cycle-consistent adaptation extends to other modalities.

DANN Architecture—Domain-Adversarial Neural Network Source Data UR5e (labeled) Target Data FANUC (unlabeled) Feature Extractor Shared Encoder 1D-CNN / Transformer Task Classifier Anomaly Detection Normal / Anomalous Task Loss Cross-entropy Gradient Reversal Layer Domain Classifier Source / Target? Binary discriminator Domain Loss GRL reverses domain gradients during backprop → feature extractor learns to confuse the discriminator Training Objective min (Task Loss)—Feature extractor minimizes anomaly detection error on labeled source data min (Domain Loss via GRL),Feature extractor maximizes domain confusion → domain-invariant features

Self-Training and Pseudo-Labelling

Self-training is conceptually simple but often effective: train on labelled source data, generate predictions (pseudo-labels) on unlabelled target data, and retrain on the combined dataset. The principal challenges are noise in the pseudo-labels and confirmation bias. Modern approaches use confidence thresholding (retaining only high-confidence pseudo-labels) and curriculum learning (beginning with the most confident predictions and gradually including less confident ones).

Optimal Transport Methods

Optimal transport provides a mathematically principled means of measuring and minimising the distance between distributions using the Wasserstein distance. It identifies the minimum cost of transforming one distribution into another and can be used to explicitly map source features to target features.

Advanced Domain Adaptation Scenarios

The standard UDA setup assumes one source and one target domain. Real-world scenarios are often more complex:

  • Multi-source domain adaptation. Labelled data is available from multiple source domains (for example, three cobot brands), and the objective is to adapt to a new target brand. Methods such as MDAN (Multi-source Domain Adversarial Networks) and M3SDA handle this by learning domain-specific and domain-shared features simultaneously.
  • Partial domain adaptation. The target domain contains fewer classes than the source. For example, the source model detects 10 types of anomalies, but the target brand exhibits only six of them. Standard UDA methods can perform poorly because they attempt to align classes that do not exist in the target.
  • Open-set domain adaptation. The target domain contains classes not seen in the source. This is realistic for cobots: a new brand may exhibit failure modes absent from the training data. Methods must both adapt known classes and detect unknown target-specific anomalies.

Method Comparison

Method Mechanism Best When Complexity Performance
MMD Match kernel mean embeddings Small domain gap, clean data Low Good baseline
CORAL Align covariance matrices Linear shifts between domains Low Good for simple shifts
DANN Adversarial domain confusion Complex nonlinear shifts Medium Strong across scenarios
Self-Training Pseudo-label target data High-confidence predictions available Low Variable (depends on pseudo-label quality)
Optimal Transport Wasserstein distance minimization Strong theoretical guarantees needed High Strong but computationally expensive

 

DANN Implementation with Gradient Reversal Layer

The following is a complete PyTorch implementation of a Domain-Adversarial Neural Network:

import torch
import torch.nn as nn
from torch.autograd import Function


class GradientReversalFunction(Function):
    """Gradient Reversal Layer (GRL).

    Forward pass: identity function.
    Backward pass: negate gradients and scale by lambda.
    """
    @staticmethod
    def forward(ctx, x, lambda_val):
        ctx.lambda_val = lambda_val
        return x.clone()

    @staticmethod
    def backward(ctx, grad_output):
        return -ctx.lambda_val * grad_output, None


class GradientReversalLayer(nn.Module):
    def __init__(self, lambda_val=1.0):
        super().__init__()
        self.lambda_val = lambda_val

    def forward(self, x):
        return GradientReversalFunction.apply(x, self.lambda_val)


class DANN(nn.Module):
    """Domain-Adversarial Neural Network for time-series data."""

    def __init__(self, n_input_channels=24, n_classes=2, n_domains=2):
        super().__init__()

        # Shared feature extractor
        self.feature_extractor = nn.Sequential(
            nn.Conv1d(n_input_channels, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Conv1d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),  # Global average pooling
        )

        # Task classifier (anomaly detection)
        self.task_classifier = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, n_classes),
        )

        # Domain discriminator
        self.domain_discriminator = nn.Sequential(
            GradientReversalLayer(lambda_val=1.0),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, n_domains),
        )

    def forward(self, x):
        features = self.feature_extractor(x).squeeze(-1)
        task_output = self.task_classifier(features)
        domain_output = self.domain_discriminator(features)
        return task_output, domain_output

    def set_lambda(self, lambda_val):
        """Update GRL lambda (schedule during training)."""
        for module in self.domain_discriminator.modules():
            if isinstance(module, GradientReversalLayer):
                module.lambda_val = lambda_val


def train_dann(model, source_loader, target_loader, n_epochs=50, device='cpu'):
    """Train DANN with progressive lambda scheduling."""
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    task_criterion = nn.CrossEntropyLoss()
    domain_criterion = nn.CrossEntropyLoss()

    model.to(device)

    for epoch in range(n_epochs):
        model.train()

        # Progressive lambda: 0 -> 1 over training
        p = epoch / n_epochs
        lambda_val = 2.0 / (1.0 + torch.exp(torch.tensor(-10.0 * p))) - 1.0
        model.set_lambda(lambda_val.item())

        # Iterate over both loaders simultaneously
        target_iter = iter(target_loader)

        for source_x, source_y in source_loader:
            try:
                target_x, _ = next(target_iter)
            except StopIteration:
                target_iter = iter(target_loader)
                target_x, _ = next(target_iter)

            source_x = source_x.to(device)
            source_y = source_y.to(device)
            target_x = target_x.to(device)

            # Source domain: label = 0
            source_task_out, source_domain_out = model(source_x)
            source_domain_labels = torch.zeros(
                source_x.size(0), dtype=torch.long, device=device
            )

            # Target domain: label = 1 (no task labels!)
            _, target_domain_out = model(target_x)
            target_domain_labels = torch.ones(
                target_x.size(0), dtype=torch.long, device=device
            )

            # Combined loss
            task_loss = task_criterion(source_task_out, source_y)
            domain_loss = domain_criterion(source_domain_out, source_domain_labels) \
                        + domain_criterion(target_domain_out, target_domain_labels)

            total_loss = task_loss + domain_loss

            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{n_epochs} | "
                  f"Task Loss: {task_loss.item():.4f} | "
                  f"Domain Loss: {domain_loss.item():.4f} | "
                  f"Lambda: {lambda_val.item():.4f}")
Key Takeaway: The gradient reversal layer is central to DANN. It causes the feature extractor to learn representations that simultaneously minimise the task classification loss and maximise the domain classification loss. The result is a set of features that are useful for anomaly detection while remaining brand-agnostic.

The Cobot Anomaly Detection Scenario

Consider applying the foregoing material to a concrete, industrially relevant problem. A factory operates multiple collaborative robots from different manufacturers: Universal Robots UR5e, FANUC CRX-10iA, ABB GoFa, KUKA LBR iiwa, and Doosan M1013. All are six- or seven-axis articulated arms performing similar tasks, and all generate sensor data: joint torques, positions, velocities, and motor currents.

The objective is one anomaly detection system that works across all brands, or, at minimum, a system that can be quickly adapted to a new brand without collecting thousands of labelled anomaly examples.

The challenge is that, despite a shared kinematic structure, each brand has fundamentally different data distributions, owing to:

  • Sensor characteristics. Different torque sensor resolutions, noise floors, and sampling rates (125 Hz, 500 Hz, or 1 kHz).
  • Control systems. Different PID gains, trajectory planning algorithms, and jerk limits.
  • Calibration. Different zero-point offsets, gear ratio tolerances, and friction models.
  • Firmware. Different interpolation methods, filtering strategies, and data encoding.

Six strategies are now examined, ranging from simple preprocessing to sophisticated neural domain adaptation.

Strategy 1: Domain-Invariant Feature Learning with DANN

This is the most principled approach. Using the DANN architecture from the previous section, the practitioner trains on labelled data from one brand (for example, the UR5e, the most common cobot with the most available data) and uses unlabelled data from other brands during training. The gradient reversal layer requires the feature extractor to learn representations that capture anomaly-relevant patterns while remaining invariant to brand-specific sensor characteristics.

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np


class CobotSensorDataset(Dataset):
    """Dataset for multi-joint cobot sensor data.

    Each sample: (n_joints * n_features, seq_len) tensor
    Features per joint: torque, position, velocity, current
    """
    def __init__(self, data, labels, domain_id):
        self.data = torch.FloatTensor(data)       # (N, channels, seq_len)
        self.labels = torch.LongTensor(labels)     # (N,) - 0=normal, 1=anomaly
        self.domain_id = domain_id

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx], self.domain_id


class CobotDANN(nn.Module):
    """DANN specifically designed for cobot anomaly detection.

    Input: multi-joint sensor data (6 joints x 4 features = 24 channels)
    Task: binary anomaly detection
    Domain: cobot brand identification (adversarial)
    """
    def __init__(self, n_joints=6, features_per_joint=4, n_brands=5):
        super().__init__()
        in_ch = n_joints * features_per_joint

        self.encoder = nn.Sequential(
            # Block 1: capture local temporal patterns
            nn.Conv1d(in_ch, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2),

            # Block 2: capture mid-range dependencies
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.MaxPool1d(2),

            # Block 3: high-level features
            nn.Conv1d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),
        )

        self.anomaly_head = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 2),
        )

        self.domain_head = nn.Sequential(
            GradientReversalLayer(lambda_val=1.0),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, n_brands),
        )

    def forward(self, x):
        features = self.encoder(x).squeeze(-1)
        anomaly_pred = self.anomaly_head(features)
        domain_pred = self.domain_head(features)
        return anomaly_pred, domain_pred, features

    def predict_anomaly(self, x):
        """Inference: only anomaly prediction needed."""
        features = self.encoder(x).squeeze(-1)
        return self.anomaly_head(features)

Strategy 2: Multi-Source Domain Adaptation

When data from multiple brands is available, all sources can be used simultaneously. The key idea is to use domain-specific batch normalisation: each brand receives its own BN layer to handle its distinctive distribution statistics, while all other weights remain shared. This captures the intuition that different brands have different means and variances in their sensor data, but the learned features (convolution filters) should be universal.

class DomainSpecificBatchNorm(nn.Module):
    """Maintain separate BN statistics per domain (brand)."""

    def __init__(self, n_features, n_domains):
        super().__init__()
        self.bn_layers = nn.ModuleList([
            nn.BatchNorm1d(n_features) for _ in range(n_domains)
        ])
        self.n_domains = n_domains

    def forward(self, x, domain_id):
        if self.training:
            return self.bn_layers[domain_id](x)
        else:
            # At inference: use the specified domain's statistics
            return self.bn_layers[domain_id](x)

    def add_domain(self):
        """Add BN layer for a new brand — initialize from average of existing."""
        new_bn = nn.BatchNorm1d(self.bn_layers[0].num_features)

        # Initialize with average statistics across existing domains
        with torch.no_grad():
            avg_mean = torch.stack(
                [bn.running_mean for bn in self.bn_layers]
            ).mean(0)
            avg_var = torch.stack(
                [bn.running_var for bn in self.bn_layers]
            ).mean(0)
            new_bn.running_mean.copy_(avg_mean)
            new_bn.running_var.copy_(avg_var)

        self.bn_layers.append(new_bn)
        self.n_domains += 1


class MultiSourceCobotModel(nn.Module):
    """Multi-source model with domain-specific batch normalization."""

    def __init__(self, n_joints=6, features_per_joint=4, n_brands=5):
        super().__init__()
        in_ch = n_joints * features_per_joint

        self.conv1 = nn.Conv1d(in_ch, 64, kernel_size=7, padding=3)
        self.bn1 = DomainSpecificBatchNorm(64, n_brands)

        self.conv2 = nn.Conv1d(64, 128, kernel_size=5, padding=2)
        self.bn2 = DomainSpecificBatchNorm(128, n_brands)

        self.conv3 = nn.Conv1d(128, 256, kernel_size=3, padding=1)
        self.bn3 = DomainSpecificBatchNorm(256, n_brands)

        self.pool = nn.AdaptiveAvgPool1d(1)
        self.classifier = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 2),
        )

    def forward(self, x, domain_id=0):
        x = torch.relu(self.bn1(self.conv1(x), domain_id))
        x = torch.relu(self.bn2(self.conv2(x), domain_id))
        x = torch.relu(self.bn3(self.conv3(x), domain_id))
        x = self.pool(x).squeeze(-1)
        return self.classifier(x)
Tip: When a new brand is introduced, call model.bn1.add_domain(), model.bn2.add_domain(), and so on. Then pass a few hundred unlabelled samples from the new brand through the model to calibrate the new BN statistics. No labelled data is required for initial deployment.

Strategy 3: Fine-Tuning with Normalisation Alignment

This is the pragmatic approach. Pre-train a full anomaly detection model on the best-labelled brand (for example, the UR5e with 50,000 labelled samples). When adapting to a new brand, freeze all convolutional and LSTM weights and fine-tune only the batch normalisation layers and the final classifier head.

The reason this approach is effective is that the kinematic structure is the same across brands. The convolutional filters that detect “sudden torque spike in joint 3” or “velocity reversal pattern” are essentially the same regardless of brand. What differs is the statistical distribution of the data, which is precisely what batch normalisation captures.

def bn_only_fine_tune(pretrained_model, target_loader, n_epochs=10, lr=1e-3):
    """Fine-tune only BatchNorm layers + classifier for a new cobot brand.

    This is the fastest adaptation strategy: typically converges in
    5-10 epochs with as few as 100-500 labeled samples.
    """
    model = pretrained_model

    # Freeze everything
    for param in model.parameters():
        param.requires_grad = False

    # Unfreeze only BatchNorm parameters and classifier
    for module in model.modules():
        if isinstance(module, nn.BatchNorm1d):
            for param in module.parameters():
                param.requires_grad = True
            # Reset running statistics for the new domain
            module.reset_running_stats()

    for param in model.classifier.parameters():
        param.requires_grad = True

    # Collect trainable params
    trainable = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.Adam(trainable, lr=lr)
    criterion = nn.CrossEntropyLoss()

    print(f"Trainable parameters: {sum(p.numel() for p in trainable):,}")
    print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

    for epoch in range(n_epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for batch_x, batch_y in target_loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            predicted = output.argmax(dim=1)
            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)

        acc = 100.0 * correct / total
        avg_loss = total_loss / len(target_loader)
        print(f"Epoch {epoch+1}/{n_epochs} | Loss: {avg_loss:.4f} | Acc: {acc:.1f}%")

    return model

Strategy 4: Contrastive Domain Adaptation

Contrastive learning offers a strong alternative to adversarial approaches. The core idea is to learn an embedding space in which “normal” operation from any brand maps to similar representations, while “anomalous” patterns remain distinguishable regardless of the brand that produced them.

A Supervised Contrastive (SupCon) loss is used. It pulls together embeddings of the same class (normal or anomaly) regardless of brand, while pushing apart embeddings of different classes:

class SupConDomainLoss(nn.Module):
    """Supervised contrastive loss that ignores domain (brand) labels.

    Positive pairs: same anomaly class, any brand
    Negative pairs: different anomaly class, any brand

    This forces brand-invariant but anomaly-discriminative embeddings.
    """
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = temperature

    def forward(self, features, labels):
        """
        Args:
            features: (batch_size, feature_dim) - L2-normalized embeddings
            labels: (batch_size,) - anomaly labels (0=normal, 1=anomaly)
        """
        device = features.device
        batch_size = features.shape[0]

        # Pairwise similarity matrix
        similarity = torch.matmul(features, features.T) / self.temperature

        # Mask: 1 where labels match (positive pairs), 0 otherwise
        labels = labels.unsqueeze(1)
        mask = torch.eq(labels, labels.T).float().to(device)

        # Remove self-similarity from mask
        self_mask = torch.eye(batch_size, device=device)
        mask = mask - self_mask

        # Numerical stability
        logits_max = similarity.max(dim=1, keepdim=True).values.detach()
        logits = similarity - logits_max

        # Denominator: all pairs except self
        exp_logits = torch.exp(logits) * (1 - self_mask)
        log_prob = logits - torch.log(exp_logits.sum(dim=1, keepdim=True) + 1e-8)

        # Average over positive pairs
        n_positives = mask.sum(dim=1)
        mean_log_prob = (mask * log_prob).sum(dim=1) / (n_positives + 1e-8)

        loss = -mean_log_prob[n_positives > 0].mean()
        return loss


class ContrastiveCobotModel(nn.Module):
    """Contrastive model for cross-brand cobot anomaly detection."""

    def __init__(self, n_input_channels=24, embed_dim=128):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Conv1d(n_input_channels, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Conv1d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),
        )

        # Projection head for contrastive learning
        self.projector = nn.Sequential(
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, embed_dim),
        )

        # Classifier for anomaly detection
        self.classifier = nn.Linear(256, 2)

    def forward(self, x):
        features = self.encoder(x).squeeze(-1)
        projections = nn.functional.normalize(self.projector(features), dim=1)
        logits = self.classifier(features)
        return logits, projections

Strategy 5: Feature Normalisation and Preprocessing

Before turning to neural domain adaptation, consider whether simple preprocessing can eliminate the distribution gap. This straightforward approach is often underused and is sometimes sufficient on its own:

import numpy as np
from scipy.interpolate import interp1d


class CobotSignalNormalizer:
    """Normalize sensor signals to a common reference frame across brands.

    This preprocessing pipeline handles:
    1. Sampling rate alignment (resample to common rate)
    2. Per-joint Z-score normalization (per brand statistics)
    3. Torque residual computation (remove gravity/friction effects)
    4. Signal clipping for outlier robustness
    """

    def __init__(self, target_sample_rate=250, target_seq_len=200):
        self.target_sample_rate = target_sample_rate
        self.target_seq_len = target_seq_len
        self.brand_stats = {}  # {brand: {joint: {feature: (mean, std)}}}

    def fit_brand(self, brand_name, data):
        """Compute normalization statistics for a brand.

        Args:
            brand_name: str, e.g. 'ur5e'
            data: np.array of shape (n_samples, n_joints, n_features, seq_len)
        """
        n_samples, n_joints, n_features, seq_len = data.shape
        stats = {}
        for j in range(n_joints):
            stats[j] = {}
            for f in range(n_features):
                channel_data = data[:, j, f, :].flatten()
                stats[j][f] = (
                    float(np.mean(channel_data)),
                    float(np.std(channel_data)) + 1e-8
                )
        self.brand_stats[brand_name] = stats

    def normalize(self, data, brand_name, source_sample_rate):
        """Normalize a batch of sensor data from a specific brand.

        Args:
            data: np.array (n_samples, n_joints, n_features, seq_len)
            brand_name: str
            source_sample_rate: int, Hz

        Returns:
            Normalized data: np.array (n_samples, n_joints*n_features, target_seq_len)
        """
        n_samples, n_joints, n_features, seq_len = data.shape

        # Step 1: Resample to common rate
        if source_sample_rate != self.target_sample_rate:
            source_times = np.linspace(0, 1, seq_len)
            target_times = np.linspace(0, 1, self.target_seq_len)
            resampled = np.zeros(
                (n_samples, n_joints, n_features, self.target_seq_len)
            )
            for i in range(n_samples):
                for j in range(n_joints):
                    for f in range(n_features):
                        interpolator = interp1d(
                            source_times, data[i, j, f, :], kind='cubic'
                        )
                        resampled[i, j, f, :] = interpolator(target_times)
            data = resampled

        # Step 2: Z-score normalization per joint per feature
        stats = self.brand_stats[brand_name]
        normalized = np.zeros_like(data)
        for j in range(n_joints):
            for f in range(n_features):
                mean, std = stats[j][f]
                normalized[:, j, f, :] = (data[:, j, f, :] - mean) / std

        # Step 3: Clip to ±5 sigma for robustness
        normalized = np.clip(normalized, -5, 5)

        # Step 4: Reshape to (n_samples, channels, seq_len)
        n_samples = normalized.shape[0]
        seq_len = normalized.shape[-1]
        output = normalized.reshape(n_samples, n_joints * n_features, seq_len)

        return output

Strategy 6: Foundation Model Approach

The most forward-looking approach draws on the emerging ecosystem of time-series foundation models. The pattern is to pre-train a large model on data from all available cobot brands in a self-supervised manner (for example, masked time-series modelling) and then fine-tune for anomaly detection with minimal labelled data from each brand.

This approach is most appropriate when substantial unlabelled sensor data is available across many brands, which is increasingly common as cobot fleets grow. Models such as Chronos (Amazon), TimesFM (Google), and Lag-Llama have shown that transformer-based architectures can learn transferable representations across diverse time-series domains.

class CobotFoundationModel(nn.Module):
    """Simplified foundation model for cobot sensor time-series.

    Pre-training task: masked sensor reconstruction
    Fine-tuning task: anomaly detection
    """
    def __init__(self, n_channels=24, d_model=256, n_heads=8,
                 n_layers=6, seq_len=200, mask_ratio=0.15):
        super().__init__()
        self.mask_ratio = mask_ratio

        # Patch embedding (treat each timestep as a "token")
        self.input_proj = nn.Linear(n_channels, d_model)
        self.pos_embedding = nn.Parameter(
            torch.randn(1, seq_len, d_model) * 0.02
        )

        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=0.1,
            batch_first=True,
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=n_layers
        )

        # Pre-training head: reconstruct masked timesteps
        self.reconstruction_head = nn.Linear(d_model, n_channels)

        # Fine-tuning head: anomaly classification
        self.anomaly_head = nn.Sequential(
            nn.Linear(d_model, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, 2),
        )

    def forward_pretrain(self, x):
        """Pre-training: masked reconstruction.

        x: (batch, n_channels, seq_len)
        """
        x = x.transpose(1, 2)  # (batch, seq_len, n_channels)
        batch_size, seq_len, _ = x.shape

        # Create random mask
        mask = torch.rand(batch_size, seq_len, device=x.device) < self.mask_ratio
        masked_x = x.clone()
        masked_x[mask] = 0.0

        # Encode
        h = self.input_proj(masked_x) + self.pos_embedding[:, :seq_len, :]
        h = self.transformer(h)

        # Reconstruct
        reconstruction = self.reconstruction_head(h)

        # Loss only on masked positions
        loss = nn.functional.mse_loss(
            reconstruction[mask], x[mask]
        )
        return loss

    def forward_anomaly(self, x):
        """Fine-tuning / inference: anomaly detection.

        x: (batch, n_channels, seq_len)
        """
        x = x.transpose(1, 2)
        h = self.input_proj(x) + self.pos_embedding[:, :x.size(1), :]
        h = self.transformer(h)

        # Global average pooling across time
        h_pooled = h.mean(dim=1)
        return self.anomaly_head(h_pooled)

Strategy Comparison and Recommendation

Strategy Labeled Data Needed Complexity Adaptation Speed Expected Performance
1. DANN Source only Medium-High Slow (retrain) High
2. Multi-Source BN Multiple sources Medium Fast (BN calibration only) High
3. BN Fine-Tuning 100-500 target samples Low Very fast (minutes) Good
4. Contrastive Source + some target Medium-High Moderate High
5. Normalization None (unsupervised stats) Very Low Instant Moderate
6. Foundation Model Minimal per brand Very High Fast (once pre-trained) Highest (with scale)

 

Key Takeaway and Recommended Pipeline: Begin with Strategy 5 (normalisation) combined with Strategy 3 (BN fine-tuning) as the baseline. This combination is fast to implement, requires minimal labelled data, and handles the most common sources of cross-brand distribution shift. If performance is insufficient, escalate to Strategy 1 (DANN) or Strategy 2 (Multi-Source BN). Reserve Strategy 6 (Foundation Model) for organisations with large-scale multi-brand data and the compute budget to match.

Practical Implementation Guide

Data Collection for Cobots

The quality of domain adaptation depends entirely on the quality of the data. For multi-brand cobot anomaly detection, the following considerations apply:

Sensor selection. At a minimum, collect per-joint torque, position, velocity, and motor current. These four signals per joint provide a comprehensive view of the robot's mechanical state. For a six-axis cobot, this yields 24 sensor channels.

Sampling rate. Different brands sample at different rates (UR5e at 500 Hz, FANUC at 250 Hz, KUKA at 1 kHz). Either resample to a common rate, or use architectures that accept variable-length inputs.

Labelling strategy. Labelling anomalies requires domain expertise. A practical approach is to label by operational segment (one pick-and-place cycle) rather than by individual timestep. Use a three-tier scheme—normal, anomalous, and uncertain—and train only on the first two.

Data volume guidelines. For the source brand, aim for at least 10,000 labelled segments (with at least 500 anomalies). For target brands, even 100 to 500 labelled segments enable effective fine-tuning under Strategy 3 or 5.

Feature Engineering for Multi-Joint Cobots

Raw sensor signals can be augmented with engineered features that capture domain-relevant physics:

  • Joint torque residuals. The difference between measured torque and the torque expected from the robot's dynamic model. This removes the "normal" torque component (gravity, inertia, friction) and isolates anomalous forces.
  • Energy consumption profiles. Power = torque × velocity per joint. Anomalies often manifest as unexpected energy consumption patterns before they appear in raw signals.
  • Vibration spectra. FFT of accelerometer or high-frequency torque data. Bearing degradation, gear wear, and loose bolts each have distinctive frequency signatures.
  • Kinematic error metrics. The difference between commanded and actual trajectory. Increasing tracking error often precedes mechanical failure.

Model Architecture Choices

Architecture Strengths Weaknesses Best For
1D-CNN Fast, local pattern detection Limited long-range dependencies Short anomaly patterns, real-time edge
LSTM/GRU Sequential memory, temporal context Slow training, vanishing gradients Long-term degradation patterns
LSTM-AutoEncoder Unsupervised, reconstruction-based Threshold tuning, slower inference Minimal labels, novelty detection
Transformer Global attention, parallelizable Data-hungry, quadratic complexity Large datasets, complex multi-joint patterns
CNN-LSTM Hybrid Best of both: local + temporal More hyperparameters General-purpose (recommended)

 

For the cobot scenario, the CNN-LSTM hybrid is typically the best starting point. A complete implementation with domain adaptation support follows:

class CobotCNNLSTMAutoEncoder(nn.Module):
    """CNN-LSTM AutoEncoder with domain adaptation for cobot anomaly detection.

    Architecture:
    - CNN encoder: extracts local temporal features
    - LSTM: captures sequential dependencies
    - CNN decoder: reconstructs input signal
    - Domain discriminator (optional): for DANN-style adaptation

    Anomaly score: reconstruction error (MSE)
    """
    def __init__(self, n_channels=24, hidden_dim=128, lstm_layers=2,
                 n_domains=None):
        super().__init__()

        # --- Encoder ---
        self.conv_encoder = nn.Sequential(
            nn.Conv1d(n_channels, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.MaxPool1d(2),
        )

        self.lstm_encoder = nn.LSTM(
            input_size=128,
            hidden_size=hidden_dim,
            num_layers=lstm_layers,
            batch_first=True,
            bidirectional=True,
            dropout=0.2,
        )

        # Bottleneck
        self.bottleneck = nn.Linear(hidden_dim * 2, hidden_dim)

        # --- Decoder ---
        self.lstm_decoder = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=lstm_layers,
            batch_first=True,
            dropout=0.2,
        )

        self.conv_decoder = nn.Sequential(
            nn.Upsample(scale_factor=2),
            nn.Conv1d(hidden_dim, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Upsample(scale_factor=2),
            nn.Conv1d(128, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, n_channels, kernel_size=3, padding=1),
        )

        # Optional domain discriminator
        self.domain_discriminator = None
        if n_domains is not None:
            self.domain_discriminator = nn.Sequential(
                GradientReversalLayer(lambda_val=1.0),
                nn.Linear(hidden_dim, 64),
                nn.ReLU(),
                nn.Linear(64, n_domains),
            )

    def encode(self, x):
        """Encode input to latent representation.

        x: (batch, n_channels, seq_len)
        """
        # CNN encoding
        conv_out = self.conv_encoder(x)  # (batch, 128, seq_len//4)

        # LSTM encoding
        conv_out = conv_out.transpose(1, 2)  # (batch, seq_len//4, 128)
        lstm_out, _ = self.lstm_encoder(conv_out)  # (batch, seq_len//4, 256)

        # Take last timestep as global representation
        global_repr = lstm_out[:, -1, :]  # (batch, 256)
        latent = self.bottleneck(global_repr)  # (batch, hidden_dim)

        return latent, conv_out.shape[1]  # return seq_len for decoder

    def decode(self, latent, target_seq_len):
        """Decode latent representation back to signal.

        latent: (batch, hidden_dim)
        """
        # Repeat latent for each timestep
        repeated = latent.unsqueeze(1).repeat(1, target_seq_len, 1)

        # LSTM decoding
        lstm_out, _ = self.lstm_decoder(repeated)  # (batch, seq_len, hidden_dim)

        # CNN decoding
        lstm_out = lstm_out.transpose(1, 2)  # (batch, hidden_dim, seq_len)
        reconstruction = self.conv_decoder(lstm_out)

        return reconstruction

    def forward(self, x):
        latent, seq_len = self.encode(x)
        reconstruction = self.decode(latent, seq_len)

        # Ensure reconstruction matches input size
        if reconstruction.size(2) != x.size(2):
            reconstruction = nn.functional.interpolate(
                reconstruction, size=x.size(2), mode='linear',
                align_corners=False
            )

        domain_pred = None
        if self.domain_discriminator is not None:
            domain_pred = self.domain_discriminator(latent)

        return reconstruction, domain_pred, latent

    def anomaly_score(self, x):
        """Compute per-sample anomaly score (reconstruction error)."""
        reconstruction, _, _ = self.forward(x)
        # MSE per sample
        mse = ((x - reconstruction) ** 2).mean(dim=(1, 2))
        return mse


def train_cobot_autoencoder(model, source_loader, target_loader=None,
                            n_epochs=100, device='cpu'):
    """Train the CNN-LSTM AutoEncoder with optional domain adaptation."""
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, n_epochs)

    model.to(device)

    for epoch in range(n_epochs):
        model.train()
        total_recon_loss = 0
        total_domain_loss = 0

        target_iter = iter(target_loader) if target_loader else None

        for batch_x, _, _ in source_loader:
            batch_x = batch_x.to(device)

            reconstruction, domain_pred, _ = model(batch_x)

            # Match sizes if needed
            if reconstruction.size(2) != batch_x.size(2):
                reconstruction = nn.functional.interpolate(
                    reconstruction, size=batch_x.size(2),
                    mode='linear', align_corners=False
                )

            recon_loss = nn.functional.mse_loss(reconstruction, batch_x)
            total_loss = recon_loss

            # Domain adaptation loss (if target data available)
            if target_iter is not None and domain_pred is not None:
                try:
                    target_x, _, _ = next(target_iter)
                except StopIteration:
                    target_iter = iter(target_loader)
                    target_x, _, _ = next(target_iter)

                target_x = target_x.to(device)
                _, target_domain_pred, _ = model(target_x)

                source_domain_labels = torch.zeros(
                    batch_x.size(0), dtype=torch.long, device=device
                )
                target_domain_labels = torch.ones(
                    target_x.size(0), dtype=torch.long, device=device
                )

                domain_loss = (
                    nn.functional.cross_entropy(domain_pred, source_domain_labels)
                    + nn.functional.cross_entropy(target_domain_pred, target_domain_labels)
                )
                total_loss += 0.1 * domain_loss
                total_domain_loss += domain_loss.item()

            optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_recon_loss += recon_loss.item()

        scheduler.step()

        if (epoch + 1) % 10 == 0:
            avg_recon = total_recon_loss / len(source_loader)
            msg = f"Epoch {epoch+1}/{n_epochs} | Recon: {avg_recon:.6f}"
            if target_loader:
                avg_domain = total_domain_loss / len(source_loader)
                msg += f" | Domain: {avg_domain:.4f}"
            print(msg)

    return model

Evaluation Metrics

For production cobot anomaly detection, standard accuracy is uninformative. The class imbalance (often 99% normal and 1% anomaly) makes it trivial to obtain high accuracy by predicting "normal" in every case. The following metrics should be used instead:

  • AUROC (Area Under the ROC Curve). The primary metric. Measures the model's ability to rank anomalous samples above normal samples regardless of threshold. Aim for above 0.95.
  • F1 Score. The harmonic mean of precision and recall at the optimal threshold. Aim for above 0.85.
  • Precision@k. If the top-k most anomalous samples are flagged, the fraction that are true anomalies. This is important for maintenance teams that can investigate only a limited number of alerts per shift.
  • False Positive Rate (FPR). Perhaps the most important metric in production. Each false positive triggers an unnecessary investigation and erodes trust in the system. Target an FPR below 1% at the operating threshold.
Caution: When evaluating domain adaptation, performance should always be measured on the target domain separately. A model with 0.98 AUROC averaged across all brands may still have 0.85 AUROC on the newest brand, and that is the brand on which performance actually matters.

Deployment Considerations

Edge versus cloud. Cobot anomaly detection often must run at the edge, directly on the robot controller or a nearby industrial PC. This constrains model size and inference latency. A CNN-based model with approximately 500K parameters can run inference in under 5 ms on an NVIDIA Jetson. The full CNN-LSTM AutoEncoder (around 2M parameters) requires roughly 20 ms. Transformer models may require cloud deployment.

Inference latency requirements. For real-time safety-critical detection (such as collision avoidance), sub-10 ms inference is required. For predictive maintenance (detecting degradation patterns), latency of 100 ms to 1 s is acceptable, since trends are analysed over minutes or hours.

Model update strategy. Domain drift occurs: sensors degrade, firmware updates change data characteristics, and new operating conditions emerge. Plan for periodic recalibration of BN statistics (weekly) and full fine-tuning (monthly) to maintain performance. Use monitoring to trigger updates: if the anomaly score distribution shifts significantly on data known to be normal, the model requires recalibration.

Putting It Together

Transfer learning is not a single technique but a paradigm that encompasses fine-tuning, domain adaptation, feature extraction, and additional related approaches. Understanding this hierarchy is the first step toward applying it effectively. Fine-tuning adapts a pre-trained model to new data through continued training. Domain adaptation bridges distribution gaps between source and target domains, even without target labels.

For heterogeneous cobot fleets, these techniques are not academic luxuries but operational necessities. The alternative is training separate models for every brand, every firmware version, and every operational context. That path produces an unmaintainable accumulation of models, each requiring its own labelled dataset.

The recommended practical pipeline begins simply: normalise sensor data across brands (Strategy 5) and fine-tune only the batch normalisation layers (Strategy 3). This baseline requires minimal labelled data and can be deployed within hours. If performance falls short, particularly on brands with unusual sensor characteristics, escalate to adversarial domain adaptation (Strategy 1 with DANN) or contrastive methods (Strategy 4). For organisations building long-term cobot intelligence platforms, investment in a foundation model (Strategy 6) yields compounding returns as the fleet grows.

The code examples throughout this article are complete and runnable. They are not production-ready: proper data loading, logging, checkpointing, and monitoring must be added. They do, however, provide the architectural foundation for any of the six strategies discussed. The most demanding aspect of cross-brand cobot anomaly detection is not the algorithm but the collection of representative data and the establishment of a labelling protocol that domain experts can follow consistently.

As collaborative robots become as common as industrial PCs on the factory floor, the ability to transfer anomaly detection across brands will distinguish organisations that scale their automation effectively from those that struggle with model maintenance. Transfer learning, fine-tuning, and domain adaptation are the tools that make such scaling possible.

References

  1. Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359.
  2. Ganin, Y., et al. (2016). Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research, 17(1), 2096-2030.
  3. Sun, B., & Saenko, K. (2016). Deep CORAL: Correlation Alignment for Deep Domain Adaptation. ECCV Workshops.
  4. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. ACL 2018.
  5. Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
  6. Ansari, A. F., et al. (2024). Chronos: Learning the Language of Time Series. arXiv preprint arXiv:2403.07815.
  7. Long, M., et al. (2015). Learning Transferable Features with Deep Adaptation Networks. ICML 2015.
  8. Tzeng, E., et al. (2017). Adversarial Discriminative Domain Adaptation. CVPR 2017.
  9. Khosla, P., et al. (2020). Supervised Contrastive Learning. NeurIPS 2020.
  10. Li, Y., et al. (2017). Revisiting Batch Normalization For Practical Domain Adaptation. ICLR Workshop 2017.
  11. Zhao, H., et al. (2018). Adversarial Multiple Source Domain Adaptation. NeurIPS 2018.
  12. Courty, N., et al. (2017). Optimal Transport for Domain Adaptation. IEEE TPAMI, 39(9), 1853-1865.
  13. Das, A., et al. (2024). A Foundation Model for Time Series Analysis. arXiv preprint arXiv:2310.10688 (TimesFM).
  14. ISO/TS 15066:2016. Robots and robotic devices—Collaborative robots. International Organization for Standardization.

Disclaimer: This article is provided for informational and educational purposes only. Code examples are provided as-is and should be thoroughly tested and validated before use in production environments, particularly in safety-critical robotics applications. Practitioners should follow their organisation's safety protocols and applicable ISO standards when deploying anomaly detection systems on collaborative robots.

You Might Also Like

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *