Summary
What this post covers: A clear separation of transfer learning, fine-tuning, and domain adaptation as a hierarchy of techniques, applied to the concrete problem of building a cross-brand anomaly detection model for heterogeneous collaborative robot fleets with runnable PyTorch examples.
Key insights:
- Transfer learning is the umbrella paradigm; fine-tuning, domain adaptation, feature extraction, multi-task learning, and few-shot transfer are sibling techniques within it, not synonyms, getting this hierarchy right prevents most conceptual errors.
- For heterogeneous cobot fleets, the cheapest effective starting point is per-channel sensor normalization plus fine-tuning only the batch normalization layers, this requires almost no target labels and can be deployed in hours.
- When BN-only adaptation falls short, escalate to adversarial domain adaptation (DANN) or supervised contrastive methods, which align source and target feature distributions even without target labels.
- Inference latency requirements drive architecture choice: a 500K-parameter CNN runs in under 5ms on Jetson hardware suitable for collision avoidance, while transformer-based models typically require cloud deployment unsuitable for real-time safety detection.
- The hardest part of cross-brand cobot anomaly detection is not the algorithm but data collection and a consistent labeling protocol that domain experts can apply across brands, firmware versions, and operating conditions.
Main topics: Transfer Learning, The Big Picture, Fine-Tuning—Techniques and Strategies, Domain Adaptation—Bridging the Distribution Gap, The Cobot Anomaly Detection Scenario, Practical Implementation Guide, Putting It Together, References.
Consider a Universal Robots UR5e and a FANUC CRX-10iA on the same production line, performing identical pick-and-place operations. Both have six joints, both lift the same payload, and both generate streams of torque, position, and velocity data every millisecond. Yet when an anomaly detection model trained on the UR5e’s data is deployed on the FANUC—despite the identity of the task—the model flags nearly everything as anomalous. The sensor noise profiles differ, the control loop frequencies do not match, and the calibration offsets produce entirely different data distributions. The model understands what “normal” looks like for one robot, but is effectively blind to normalcy on another.
This is not a hypothetical problem. As collaborative robots (cobots) proliferate across manufacturing, logistics, and healthcare, organisations increasingly operate heterogeneous fleets that span multiple brands, generations, and firmware versions. Training a separate anomaly detection model for every brand is expensive, slow, and inefficient. The question is whether a model can transfer its understanding of normal robot behaviour across brands.
This is precisely the problem that transfer learning, fine-tuning, and domain adaptation were designed to address. The following sections examine these three concepts, clarify how they relate to one another, and apply them to a concrete scenario: building a cross-brand anomaly detection system for heterogeneous cobots. The treatment provides both theoretical understanding and complete, runnable PyTorch code for several adaptation strategies.
Before proceeding, the conceptual hierarchy that frames the discussion should be made explicit:
Transfer Learning (broad paradigm)
├── Fine-Tuning (retrain pre-trained model on new data)
├── Domain Adaptation (bridge distribution gap between domains)
│ ├── Supervised Domain Adaptation
│ ├── Unsupervised Domain Adaptation (UDA)
│ └── Semi-Supervised Domain Adaptation
├── Feature Extraction (freeze pre-trained layers, train new head)
├── Multi-Task Learning (shared representations)
└── Zero-Shot / Few-Shot Transfer
Transfer learning is the overarching idea: take knowledge learned in one context and apply it in another. Fine-tuning is one mechanism for doing so, in which a pre-trained model is further trained on the target data. Domain adaptation is another mechanism, which specifically addresses the situation in which source and target data come from different distributions. Feature extraction, multi-task learning, and zero- or few-shot transfer are additional strategies under the same umbrella. They are sibling strategies, not synonyms.
With that framework established, each technique is examined in detail below.
Transfer Learning, The Big Picture
Formal Definition
Transfer learning is the paradigm of using knowledge acquired from a source task or domain to improve learning on a target task or domain. Formally, given a source domain DS with a learning task TS, and a target domain DT with a learning task TT, transfer learning aims to improve the learning of the target predictive function fT(·) using knowledge from DS and TS, where DS ≠ DT or TS ≠ TT.
Expressed informally: resources have already been spent learning something useful in one context. The objective is to reuse that learning rather than start from scratch.
Why Transfer Learning Matters
The motivation is overwhelmingly practical:
- Limited labelled data. Labelling anomalies in cobot sensor data requires domain experts familiar with both the robot’s kinematics and the manufacturing process. Thousands of labelled samples may be available for one robot brand, but very few for another.
- Expensive annotation. Each labelled anomaly may require a robotics engineer to review hours of sensor logs. At 150 USD per hour, labelling 10,000 samples across five brands can cost more than the robots themselves.
- Faster convergence. A model initialised with transferred knowledge reaches acceptable performance in hours rather than weeks.
- Better generalisation. Features learned from large, diverse datasets often capture general patterns that improve performance even on seemingly unrelated tasks.
Types of Transfer Learning
The taxonomy breaks down based on what differs between source and target:
| Type | Source Labels | Target Labels | Relationship | Example |
|---|---|---|---|---|
| Inductive Transfer | Available | Available | TS ≠ TT | ImageNet classification → medical image segmentation |
| Transductive Transfer | Available | Not available | DS ≠ DT, TS = TT | UR5e anomaly detection → FANUC anomaly detection (no FANUC labels) |
| Unsupervised Transfer | Not available | Not available | DS ≠ DT | Self-supervised pre-training on all cobot data → clustering |
For our cobot scenario, transductive transfer is the most relevant: we have labeled anomaly data from one or a few brands (source domains) and want to perform the same anomaly detection task on new brands (target domains) where labels are scarce or nonexistent.
When Transfer Learning Works, and When It Fails
Transfer learning is not a universal solution. It works when source and target share underlying structure. A model trained on ImageNet transfers well to medical imaging because both involve recognising edges, textures, and shapes. A model trained on English text transfers well to French because the two languages share grammatical abstractions.
It fails, sometimes substantially, when source and target are too dissimilar. This is termed negative transfer: the transferred knowledge actively degrades performance on the target task. For example, a model trained on satellite imagery may transfer poorly to microscopy images despite both being images. The spatial scales, textures, and semantic content differ fundamentally.
In the cobot scenario, transfer learning is promising because the robots share the same fundamental kinematic structure. A six-axis articulated arm generates torque profiles that follow similar physical laws regardless of brand. The differences arise in sensor calibration, noise characteristics, and control-system specifics—exactly the kind of distribution shift that domain adaptation was designed to handle.
Historical Context
The modern era of transfer learning began with ImageNet. In 2012, AlexNet demonstrated that deep CNNs could learn powerful visual features. By 2014, researchers had observed that these features, especially those from early layers, transferred remarkably well to other vision tasks. “ImageNet pre-training” became the default starting point for nearly every computer vision project.
NLP followed a similar trajectory. Word2Vec and GloVe provided transferable word embeddings, but the broader transformation came with BERT (2018) and GPT (2018–2019), which showed that pre-training on substantial text corpora created representations that transferred to nearly any language task. Today’s large language models are perhaps the most extensive transfer learning systems: pre-trained on trillions of tokens, then fine-tuned or prompted for specific tasks.
Time-series and industrial AI are now undergoing their own transfer learning shift. Models such as Chronos, TimesFM, and Lag-Llama are emerging as foundation models for temporal data, and domain adaptation for sensor data is an active research area with direct industrial application.
Training From Scratch vs. Transfer Learning
| Factor | From Scratch | Transfer Learning |
|---|---|---|
| Labeled data needed | Large (10k–1M+ samples) | Small (100–1k samples) |
| Training time | Days to weeks | Hours to days |
| Compute cost | High (multi-GPU) | Low to moderate (single GPU) |
| Performance (limited data) | Poor (overfits) | Good to excellent |
| Performance (abundant data) | Excellent (eventually) | Excellent (faster) |
| Domain expertise needed | High (architecture design) | Moderate (strategy selection) |
| Risk of negative transfer | None | Possible if domains too different |
Fine-Tuning—Techniques and Strategies
Fine-tuning is the most widely used transfer learning technique: take a model pre-trained on a source task or domain and continue training it on the target data. The concept is simple, but the practice is nuanced.
Full Fine-Tuning and Partial Fine-Tuning
Full fine-tuning updates all parameters of the pre-trained model. This affords maximum flexibility to adapt, but also presents the highest risk of overfitting, particularly when the target dataset is small. With 50,000 labelled samples in the target domain, full fine-tuning is generally safe. With 500, it is risky.
Partial fine-tuning freezes some layers (typically the earlier ones) and updates only the remainder. The reasoning is that early layers learn generic, transferable features (edge detectors in vision, basic temporal patterns in time-series), while later layers learn task-specific features. Freezing early layers preserves the generic knowledge while adapting the task-specific parts.
Layer-Wise Learning Rate Decay (Discriminative Fine-Tuning)
Rather than imposing a binary freeze/unfreeze decision, discriminative fine-tuning assigns different learning rates to different layers. Earlier layers receive smaller learning rates (they change slowly), while later layers receive larger learning rates (they require more adaptation). A common approach multiplies the learning rate by a decay factor for each layer moving backwards from the output:
# Discriminative learning rates in PyTorch
def get_discriminative_params(model, base_lr=1e-3, decay_factor=0.9):
"""Assign decreasing learning rates to earlier layers."""
params = []
layers = list(model.named_parameters())
n_layers = len(layers)
for i, (name, param) in enumerate(layers):
# Earlier layers get smaller LR
layer_lr = base_lr * (decay_factor ** (n_layers - i - 1))
params.append({
'params': param,
'lr': layer_lr,
'name': name
})
return params
# Usage
param_groups = get_discriminative_params(model, base_lr=1e-3, decay_factor=0.85)
optimizer = torch.optim.AdamW(param_groups)
Gradual Unfreezing
Gradual unfreezing begins by training only the final layer (or layers), then progressively unfreezes earlier layers as training proceeds. This prevents early layers from being corrupted by the large gradients that occur at the start of fine-tuning when the loss is high. The strategy was popularised by ULMFiT (Universal Language Model Fine-tuning) and works well for both NLP and time-series tasks.
The Fine-Tuning Decision Matrix
The appropriate fine-tuning strategy depends on two factors: the amount of available target data and the similarity between source and target domains.
| Scenario | Target Data Size | Domain Similarity | Recommended Strategy |
|---|---|---|---|
| A | Small (<1k) | High | Feature extraction only (freeze all, train classifier head) |
| B | Small (<1k) | Low | Fine-tune final layers with aggressive regularization |
| C | Large (>10k) | High | Full fine-tuning with small learning rate |
| D | Large (>10k) | Low | Full fine-tuning or train from scratch |
For cobots that share kinematic structure but differ in brand, the situation falls firmly in the high domain similarity column. When labelled data for the target brand is limited (a common case), Scenario A applies, calling for feature extraction or minimal fine-tuning. When substantial data is available, Scenario C applies, with gentle full fine-tuning.
Regularisation During Fine-Tuning
Fine-tuning on small datasets risks catastrophic forgetting, in which the model loses what it learned during pre-training. Several regularisation techniques help mitigate this risk:
- L2-SP (L2 penalty toward starting point). Instead of penalising weights toward zero, penalise them toward their pre-trained values. This keeps the model close to the pre-trained solution while allowing adaptation.
- Dropout. Especially effective when added to fine-tuning layers. Typical values are 0.1 to 0.3 during fine-tuning, compared with 0.5 during training from scratch.
- Early stopping. Monitor validation loss on the target domain and halt training when it begins to increase. With small target datasets, overfitting can occur within a few epochs.
- Weight decay. Standard L2 regularisation remains effective, typically at 0.01 to 0.1 during fine-tuning.
Modern Parameter-Efficient Fine-Tuning
Full fine-tuning updates millions or billions of parameters, which is computationally expensive and requires storing a full copy of the model per task. Parameter-efficient fine-tuning (PEFT) methods address this constraint by updating only a small subset of parameters:
- LoRA (Low-Rank Adaptation). Injects low-rank matrices into each layer. Rather than updating a weight matrix W directly, LoRA decomposes the update as ΔW = BA, where B and A are low-rank matrices. This reduces trainable parameters by a factor of approximately 10,000 while preserving performance.
- QLoRA. Combines LoRA with 4-bit quantisation of the base model, enabling fine-tuning of large models on a single consumer GPU.
- Adapters. Small bottleneck modules inserted between existing layers. Only adapter parameters are trained; the remainder remains frozen.
- Prefix Tuning and Prompt Tuning. Prepend learnable vectors to the input or hidden states. These approaches originated in NLP but are conceptually applicable to any sequence model.
Fine-Tuning Code Example
The following is a complete example of fine-tuning a PyTorch model with layer freezing and discriminative learning rates for a time-series anomaly detection task:
import torch
import torch.nn as nn
class CobotAnomalyModel(nn.Module):
"""1D-CNN feature extractor + classifier for cobot anomaly detection."""
def __init__(self, n_joints=6, n_features_per_joint=4, seq_len=200):
super().__init__()
in_channels = n_joints * n_features_per_joint # 24 input channels
# Feature extractor (transferable layers)
self.features = nn.Sequential(
nn.Conv1d(in_channels, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1)
)
# Classifier head (task-specific)
self.classifier = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 2) # normal vs anomaly
)
def forward(self, x):
# x shape: (batch, channels, seq_len)
feat = self.features(x).squeeze(-1)
return self.classifier(feat)
def fine_tune_for_new_brand(
pretrained_model,
target_loader,
val_loader,
freeze_features=True,
base_lr=1e-3,
n_epochs=30
):
"""Fine-tune a pre-trained cobot model for a new brand."""
model = pretrained_model
if freeze_features:
# Strategy A: freeze feature extractor, train only classifier
for param in model.features.parameters():
param.requires_grad = False
optimizer = torch.optim.Adam(
model.classifier.parameters(), lr=base_lr
)
else:
# Strategy C: discriminative learning rates
param_groups = [
{'params': model.features.parameters(), 'lr': base_lr * 0.1},
{'params': model.classifier.parameters(), 'lr': base_lr},
]
optimizer = torch.optim.Adam(param_groups)
criterion = nn.CrossEntropyLoss()
best_val_loss = float('inf')
patience_counter = 0
for epoch in range(n_epochs):
model.train()
for batch_x, batch_y in target_loader:
optimizer.zero_grad()
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
# Validation and early stopping
model.eval()
val_loss = 0
with torch.no_grad():
for batch_x, batch_y in val_loader:
output = model(batch_x)
val_loss += criterion(output, batch_y).item()
val_loss /= len(val_loader)
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
torch.save(model.state_dict(), 'best_model.pt')
else:
patience_counter += 1
if patience_counter >= 5:
print(f"Early stopping at epoch {epoch}")
break
model.load_state_dict(torch.load('best_model.pt'))
return model
Domain Adaptation: Bridging the Distribution Gap
Whereas fine-tuning assumes that at least some labelled data is available in the target domain, domain adaptation addresses a harder problem: substantial labelled data in the source domain, but no labels at all in the target domain. This is unsupervised domain adaptation (UDA), the most common and challenging scenario in real-world deployments.
Formal Definition
In domain adaptation, source and target domains share the same task (for example, anomaly detection) but have different data distributions. Formally: PS(X) ≠ PT(X), while the labelling function is identical. The objective is to learn a model that performs well on the target distribution despite being trained primarily on the source distribution.
Several types of distribution shift can occur:
- Covariate shift. P(X) changes while P(Y|X) remains constant. The input distributions differ but the relationship between inputs and outputs is preserved. This is the most common scenario for cobots: sensor data distributions differ across brands, while the definition of “anomaly” remains consistent.
- Label shift. P(Y) changes while P(X|Y) remains constant. The prior probability of classes changes. For example, one brand may have a 2% anomaly rate while another has 5%.
- Concept drift. P(Y|X) changes—the same input has different meanings in different domains. This is rare for same-structure cobots but can arise when different brands define “normal operating range” differently.
Key Unsupervised Domain Adaptation Methods
Discrepancy-Based Methods
These methods explicitly measure and minimise the distance between source and target feature distributions.
Maximum Mean Discrepancy (MMD) measures the distance between two distributions by comparing their mean embeddings in a reproducing kernel Hilbert space (RKHS). If the mean embeddings are identical, the distributions are identical (for characteristic kernels). In practice, an MMD penalty is added to the training loss to encourage the network to produce similar feature distributions for source and target data.
CORAL (CORrelation ALignment) aligns the second-order statistics (covariance matrices) of source and target features. Deep CORAL integrates this alignment into the network by adding a CORAL loss at one or more hidden layers. The CORAL loss is the Frobenius norm of the difference between source and target covariance matrices.
Adversarial-Based Methods
These methods use an adversarial framework to learn domain-invariant features—features that are useful for the task but cannot be used by a discriminator to distinguish between source and target domains.
Domain-Adversarial Neural Networks (DANN) represent the principal approach. The architecture has three components: a shared feature extractor, a task classifier (for anomaly detection), and a domain discriminator. The key element is the gradient reversal layer (GRL): during backpropagation, gradients from the domain discriminator are reversed before reaching the feature extractor. The feature extractor is thus trained to maximise the domain discriminator’s loss—that is, to produce features that confuse the discriminator about which domain the data came from.
ADDA (Adversarial Discriminative Domain Adaptation) uses separate feature extractors for source and target, with the target extractor initialised from the source. The adversarial dynamic operates between the target encoder and the discriminator.
CyCADA (Cycle-Consistent Adversarial Domain Adaptation) combines pixel-level adaptation (using CycleGAN-style image translation) with feature-level adaptation. Although primarily used for visual tasks, the concept of cycle-consistent adaptation extends to other modalities.
Self-Training and Pseudo-Labelling
Self-training is conceptually simple but often effective: train on labelled source data, generate predictions (pseudo-labels) on unlabelled target data, and retrain on the combined dataset. The principal challenges are noise in the pseudo-labels and confirmation bias. Modern approaches use confidence thresholding (retaining only high-confidence pseudo-labels) and curriculum learning (beginning with the most confident predictions and gradually including less confident ones).
Optimal Transport Methods
Optimal transport provides a mathematically principled means of measuring and minimising the distance between distributions using the Wasserstein distance. It identifies the minimum cost of transforming one distribution into another and can be used to explicitly map source features to target features.
Advanced Domain Adaptation Scenarios
The standard UDA setup assumes one source and one target domain. Real-world scenarios are often more complex:
- Multi-source domain adaptation. Labelled data is available from multiple source domains (for example, three cobot brands), and the objective is to adapt to a new target brand. Methods such as MDAN (Multi-source Domain Adversarial Networks) and M3SDA handle this by learning domain-specific and domain-shared features simultaneously.
- Partial domain adaptation. The target domain contains fewer classes than the source. For example, the source model detects 10 types of anomalies, but the target brand exhibits only six of them. Standard UDA methods can perform poorly because they attempt to align classes that do not exist in the target.
- Open-set domain adaptation. The target domain contains classes not seen in the source. This is realistic for cobots: a new brand may exhibit failure modes absent from the training data. Methods must both adapt known classes and detect unknown target-specific anomalies.
Method Comparison
| Method | Mechanism | Best When | Complexity | Performance |
|---|---|---|---|---|
| MMD | Match kernel mean embeddings | Small domain gap, clean data | Low | Good baseline |
| CORAL | Align covariance matrices | Linear shifts between domains | Low | Good for simple shifts |
| DANN | Adversarial domain confusion | Complex nonlinear shifts | Medium | Strong across scenarios |
| Self-Training | Pseudo-label target data | High-confidence predictions available | Low | Variable (depends on pseudo-label quality) |
| Optimal Transport | Wasserstein distance minimization | Strong theoretical guarantees needed | High | Strong but computationally expensive |
DANN Implementation with Gradient Reversal Layer
The following is a complete PyTorch implementation of a Domain-Adversarial Neural Network:
import torch
import torch.nn as nn
from torch.autograd import Function
class GradientReversalFunction(Function):
"""Gradient Reversal Layer (GRL).
Forward pass: identity function.
Backward pass: negate gradients and scale by lambda.
"""
@staticmethod
def forward(ctx, x, lambda_val):
ctx.lambda_val = lambda_val
return x.clone()
@staticmethod
def backward(ctx, grad_output):
return -ctx.lambda_val * grad_output, None
class GradientReversalLayer(nn.Module):
def __init__(self, lambda_val=1.0):
super().__init__()
self.lambda_val = lambda_val
def forward(self, x):
return GradientReversalFunction.apply(x, self.lambda_val)
class DANN(nn.Module):
"""Domain-Adversarial Neural Network for time-series data."""
def __init__(self, n_input_channels=24, n_classes=2, n_domains=2):
super().__init__()
# Shared feature extractor
self.feature_extractor = nn.Sequential(
nn.Conv1d(n_input_channels, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Conv1d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1), # Global average pooling
)
# Task classifier (anomaly detection)
self.task_classifier = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, n_classes),
)
# Domain discriminator
self.domain_discriminator = nn.Sequential(
GradientReversalLayer(lambda_val=1.0),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, n_domains),
)
def forward(self, x):
features = self.feature_extractor(x).squeeze(-1)
task_output = self.task_classifier(features)
domain_output = self.domain_discriminator(features)
return task_output, domain_output
def set_lambda(self, lambda_val):
"""Update GRL lambda (schedule during training)."""
for module in self.domain_discriminator.modules():
if isinstance(module, GradientReversalLayer):
module.lambda_val = lambda_val
def train_dann(model, source_loader, target_loader, n_epochs=50, device='cpu'):
"""Train DANN with progressive lambda scheduling."""
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
task_criterion = nn.CrossEntropyLoss()
domain_criterion = nn.CrossEntropyLoss()
model.to(device)
for epoch in range(n_epochs):
model.train()
# Progressive lambda: 0 -> 1 over training
p = epoch / n_epochs
lambda_val = 2.0 / (1.0 + torch.exp(torch.tensor(-10.0 * p))) - 1.0
model.set_lambda(lambda_val.item())
# Iterate over both loaders simultaneously
target_iter = iter(target_loader)
for source_x, source_y in source_loader:
try:
target_x, _ = next(target_iter)
except StopIteration:
target_iter = iter(target_loader)
target_x, _ = next(target_iter)
source_x = source_x.to(device)
source_y = source_y.to(device)
target_x = target_x.to(device)
# Source domain: label = 0
source_task_out, source_domain_out = model(source_x)
source_domain_labels = torch.zeros(
source_x.size(0), dtype=torch.long, device=device
)
# Target domain: label = 1 (no task labels!)
_, target_domain_out = model(target_x)
target_domain_labels = torch.ones(
target_x.size(0), dtype=torch.long, device=device
)
# Combined loss
task_loss = task_criterion(source_task_out, source_y)
domain_loss = domain_criterion(source_domain_out, source_domain_labels) \
+ domain_criterion(target_domain_out, target_domain_labels)
total_loss = task_loss + domain_loss
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{n_epochs} | "
f"Task Loss: {task_loss.item():.4f} | "
f"Domain Loss: {domain_loss.item():.4f} | "
f"Lambda: {lambda_val.item():.4f}")
The Cobot Anomaly Detection Scenario
Consider applying the foregoing material to a concrete, industrially relevant problem. A factory operates multiple collaborative robots from different manufacturers: Universal Robots UR5e, FANUC CRX-10iA, ABB GoFa, KUKA LBR iiwa, and Doosan M1013. All are six- or seven-axis articulated arms performing similar tasks, and all generate sensor data: joint torques, positions, velocities, and motor currents.
The objective is one anomaly detection system that works across all brands, or, at minimum, a system that can be quickly adapted to a new brand without collecting thousands of labelled anomaly examples.
The challenge is that, despite a shared kinematic structure, each brand has fundamentally different data distributions, owing to:
- Sensor characteristics. Different torque sensor resolutions, noise floors, and sampling rates (125 Hz, 500 Hz, or 1 kHz).
- Control systems. Different PID gains, trajectory planning algorithms, and jerk limits.
- Calibration. Different zero-point offsets, gear ratio tolerances, and friction models.
- Firmware. Different interpolation methods, filtering strategies, and data encoding.
Six strategies are now examined, ranging from simple preprocessing to sophisticated neural domain adaptation.
Strategy 1: Domain-Invariant Feature Learning with DANN
This is the most principled approach. Using the DANN architecture from the previous section, the practitioner trains on labelled data from one brand (for example, the UR5e, the most common cobot with the most available data) and uses unlabelled data from other brands during training. The gradient reversal layer requires the feature extractor to learn representations that capture anomaly-relevant patterns while remaining invariant to brand-specific sensor characteristics.
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
class CobotSensorDataset(Dataset):
"""Dataset for multi-joint cobot sensor data.
Each sample: (n_joints * n_features, seq_len) tensor
Features per joint: torque, position, velocity, current
"""
def __init__(self, data, labels, domain_id):
self.data = torch.FloatTensor(data) # (N, channels, seq_len)
self.labels = torch.LongTensor(labels) # (N,) - 0=normal, 1=anomaly
self.domain_id = domain_id
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx], self.domain_id
class CobotDANN(nn.Module):
"""DANN specifically designed for cobot anomaly detection.
Input: multi-joint sensor data (6 joints x 4 features = 24 channels)
Task: binary anomaly detection
Domain: cobot brand identification (adversarial)
"""
def __init__(self, n_joints=6, features_per_joint=4, n_brands=5):
super().__init__()
in_ch = n_joints * features_per_joint
self.encoder = nn.Sequential(
# Block 1: capture local temporal patterns
nn.Conv1d(in_ch, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.MaxPool1d(2),
# Block 2: capture mid-range dependencies
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.MaxPool1d(2),
# Block 3: high-level features
nn.Conv1d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1),
)
self.anomaly_head = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 2),
)
self.domain_head = nn.Sequential(
GradientReversalLayer(lambda_val=1.0),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, n_brands),
)
def forward(self, x):
features = self.encoder(x).squeeze(-1)
anomaly_pred = self.anomaly_head(features)
domain_pred = self.domain_head(features)
return anomaly_pred, domain_pred, features
def predict_anomaly(self, x):
"""Inference: only anomaly prediction needed."""
features = self.encoder(x).squeeze(-1)
return self.anomaly_head(features)
Strategy 2: Multi-Source Domain Adaptation
When data from multiple brands is available, all sources can be used simultaneously. The key idea is to use domain-specific batch normalisation: each brand receives its own BN layer to handle its distinctive distribution statistics, while all other weights remain shared. This captures the intuition that different brands have different means and variances in their sensor data, but the learned features (convolution filters) should be universal.
class DomainSpecificBatchNorm(nn.Module):
"""Maintain separate BN statistics per domain (brand)."""
def __init__(self, n_features, n_domains):
super().__init__()
self.bn_layers = nn.ModuleList([
nn.BatchNorm1d(n_features) for _ in range(n_domains)
])
self.n_domains = n_domains
def forward(self, x, domain_id):
if self.training:
return self.bn_layers[domain_id](x)
else:
# At inference: use the specified domain's statistics
return self.bn_layers[domain_id](x)
def add_domain(self):
"""Add BN layer for a new brand — initialize from average of existing."""
new_bn = nn.BatchNorm1d(self.bn_layers[0].num_features)
# Initialize with average statistics across existing domains
with torch.no_grad():
avg_mean = torch.stack(
[bn.running_mean for bn in self.bn_layers]
).mean(0)
avg_var = torch.stack(
[bn.running_var for bn in self.bn_layers]
).mean(0)
new_bn.running_mean.copy_(avg_mean)
new_bn.running_var.copy_(avg_var)
self.bn_layers.append(new_bn)
self.n_domains += 1
class MultiSourceCobotModel(nn.Module):
"""Multi-source model with domain-specific batch normalization."""
def __init__(self, n_joints=6, features_per_joint=4, n_brands=5):
super().__init__()
in_ch = n_joints * features_per_joint
self.conv1 = nn.Conv1d(in_ch, 64, kernel_size=7, padding=3)
self.bn1 = DomainSpecificBatchNorm(64, n_brands)
self.conv2 = nn.Conv1d(64, 128, kernel_size=5, padding=2)
self.bn2 = DomainSpecificBatchNorm(128, n_brands)
self.conv3 = nn.Conv1d(128, 256, kernel_size=3, padding=1)
self.bn3 = DomainSpecificBatchNorm(256, n_brands)
self.pool = nn.AdaptiveAvgPool1d(1)
self.classifier = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 2),
)
def forward(self, x, domain_id=0):
x = torch.relu(self.bn1(self.conv1(x), domain_id))
x = torch.relu(self.bn2(self.conv2(x), domain_id))
x = torch.relu(self.bn3(self.conv3(x), domain_id))
x = self.pool(x).squeeze(-1)
return self.classifier(x)
model.bn1.add_domain(), model.bn2.add_domain(), and so on. Then pass a few hundred unlabelled samples from the new brand through the model to calibrate the new BN statistics. No labelled data is required for initial deployment.
Strategy 3: Fine-Tuning with Normalisation Alignment
This is the pragmatic approach. Pre-train a full anomaly detection model on the best-labelled brand (for example, the UR5e with 50,000 labelled samples). When adapting to a new brand, freeze all convolutional and LSTM weights and fine-tune only the batch normalisation layers and the final classifier head.
The reason this approach is effective is that the kinematic structure is the same across brands. The convolutional filters that detect “sudden torque spike in joint 3” or “velocity reversal pattern” are essentially the same regardless of brand. What differs is the statistical distribution of the data, which is precisely what batch normalisation captures.
def bn_only_fine_tune(pretrained_model, target_loader, n_epochs=10, lr=1e-3):
"""Fine-tune only BatchNorm layers + classifier for a new cobot brand.
This is the fastest adaptation strategy: typically converges in
5-10 epochs with as few as 100-500 labeled samples.
"""
model = pretrained_model
# Freeze everything
for param in model.parameters():
param.requires_grad = False
# Unfreeze only BatchNorm parameters and classifier
for module in model.modules():
if isinstance(module, nn.BatchNorm1d):
for param in module.parameters():
param.requires_grad = True
# Reset running statistics for the new domain
module.reset_running_stats()
for param in model.classifier.parameters():
param.requires_grad = True
# Collect trainable params
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(trainable, lr=lr)
criterion = nn.CrossEntropyLoss()
print(f"Trainable parameters: {sum(p.numel() for p in trainable):,}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
for epoch in range(n_epochs):
model.train()
total_loss = 0
correct = 0
total = 0
for batch_x, batch_y in target_loader:
optimizer.zero_grad()
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.item()
predicted = output.argmax(dim=1)
correct += (predicted == batch_y).sum().item()
total += batch_y.size(0)
acc = 100.0 * correct / total
avg_loss = total_loss / len(target_loader)
print(f"Epoch {epoch+1}/{n_epochs} | Loss: {avg_loss:.4f} | Acc: {acc:.1f}%")
return model
Strategy 4: Contrastive Domain Adaptation
Contrastive learning offers a strong alternative to adversarial approaches. The core idea is to learn an embedding space in which “normal” operation from any brand maps to similar representations, while “anomalous” patterns remain distinguishable regardless of the brand that produced them.
A Supervised Contrastive (SupCon) loss is used. It pulls together embeddings of the same class (normal or anomaly) regardless of brand, while pushing apart embeddings of different classes:
class SupConDomainLoss(nn.Module):
"""Supervised contrastive loss that ignores domain (brand) labels.
Positive pairs: same anomaly class, any brand
Negative pairs: different anomaly class, any brand
This forces brand-invariant but anomaly-discriminative embeddings.
"""
def __init__(self, temperature=0.07):
super().__init__()
self.temperature = temperature
def forward(self, features, labels):
"""
Args:
features: (batch_size, feature_dim) - L2-normalized embeddings
labels: (batch_size,) - anomaly labels (0=normal, 1=anomaly)
"""
device = features.device
batch_size = features.shape[0]
# Pairwise similarity matrix
similarity = torch.matmul(features, features.T) / self.temperature
# Mask: 1 where labels match (positive pairs), 0 otherwise
labels = labels.unsqueeze(1)
mask = torch.eq(labels, labels.T).float().to(device)
# Remove self-similarity from mask
self_mask = torch.eye(batch_size, device=device)
mask = mask - self_mask
# Numerical stability
logits_max = similarity.max(dim=1, keepdim=True).values.detach()
logits = similarity - logits_max
# Denominator: all pairs except self
exp_logits = torch.exp(logits) * (1 - self_mask)
log_prob = logits - torch.log(exp_logits.sum(dim=1, keepdim=True) + 1e-8)
# Average over positive pairs
n_positives = mask.sum(dim=1)
mean_log_prob = (mask * log_prob).sum(dim=1) / (n_positives + 1e-8)
loss = -mean_log_prob[n_positives > 0].mean()
return loss
class ContrastiveCobotModel(nn.Module):
"""Contrastive model for cross-brand cobot anomaly detection."""
def __init__(self, n_input_channels=24, embed_dim=128):
super().__init__()
self.encoder = nn.Sequential(
nn.Conv1d(n_input_channels, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Conv1d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1),
)
# Projection head for contrastive learning
self.projector = nn.Sequential(
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, embed_dim),
)
# Classifier for anomaly detection
self.classifier = nn.Linear(256, 2)
def forward(self, x):
features = self.encoder(x).squeeze(-1)
projections = nn.functional.normalize(self.projector(features), dim=1)
logits = self.classifier(features)
return logits, projections
Strategy 5: Feature Normalisation and Preprocessing
Before turning to neural domain adaptation, consider whether simple preprocessing can eliminate the distribution gap. This straightforward approach is often underused and is sometimes sufficient on its own:
import numpy as np
from scipy.interpolate import interp1d
class CobotSignalNormalizer:
"""Normalize sensor signals to a common reference frame across brands.
This preprocessing pipeline handles:
1. Sampling rate alignment (resample to common rate)
2. Per-joint Z-score normalization (per brand statistics)
3. Torque residual computation (remove gravity/friction effects)
4. Signal clipping for outlier robustness
"""
def __init__(self, target_sample_rate=250, target_seq_len=200):
self.target_sample_rate = target_sample_rate
self.target_seq_len = target_seq_len
self.brand_stats = {} # {brand: {joint: {feature: (mean, std)}}}
def fit_brand(self, brand_name, data):
"""Compute normalization statistics for a brand.
Args:
brand_name: str, e.g. 'ur5e'
data: np.array of shape (n_samples, n_joints, n_features, seq_len)
"""
n_samples, n_joints, n_features, seq_len = data.shape
stats = {}
for j in range(n_joints):
stats[j] = {}
for f in range(n_features):
channel_data = data[:, j, f, :].flatten()
stats[j][f] = (
float(np.mean(channel_data)),
float(np.std(channel_data)) + 1e-8
)
self.brand_stats[brand_name] = stats
def normalize(self, data, brand_name, source_sample_rate):
"""Normalize a batch of sensor data from a specific brand.
Args:
data: np.array (n_samples, n_joints, n_features, seq_len)
brand_name: str
source_sample_rate: int, Hz
Returns:
Normalized data: np.array (n_samples, n_joints*n_features, target_seq_len)
"""
n_samples, n_joints, n_features, seq_len = data.shape
# Step 1: Resample to common rate
if source_sample_rate != self.target_sample_rate:
source_times = np.linspace(0, 1, seq_len)
target_times = np.linspace(0, 1, self.target_seq_len)
resampled = np.zeros(
(n_samples, n_joints, n_features, self.target_seq_len)
)
for i in range(n_samples):
for j in range(n_joints):
for f in range(n_features):
interpolator = interp1d(
source_times, data[i, j, f, :], kind='cubic'
)
resampled[i, j, f, :] = interpolator(target_times)
data = resampled
# Step 2: Z-score normalization per joint per feature
stats = self.brand_stats[brand_name]
normalized = np.zeros_like(data)
for j in range(n_joints):
for f in range(n_features):
mean, std = stats[j][f]
normalized[:, j, f, :] = (data[:, j, f, :] - mean) / std
# Step 3: Clip to ±5 sigma for robustness
normalized = np.clip(normalized, -5, 5)
# Step 4: Reshape to (n_samples, channels, seq_len)
n_samples = normalized.shape[0]
seq_len = normalized.shape[-1]
output = normalized.reshape(n_samples, n_joints * n_features, seq_len)
return output
Strategy 6: Foundation Model Approach
The most forward-looking approach draws on the emerging ecosystem of time-series foundation models. The pattern is to pre-train a large model on data from all available cobot brands in a self-supervised manner (for example, masked time-series modelling) and then fine-tune for anomaly detection with minimal labelled data from each brand.
This approach is most appropriate when substantial unlabelled sensor data is available across many brands, which is increasingly common as cobot fleets grow. Models such as Chronos (Amazon), TimesFM (Google), and Lag-Llama have shown that transformer-based architectures can learn transferable representations across diverse time-series domains.
class CobotFoundationModel(nn.Module):
"""Simplified foundation model for cobot sensor time-series.
Pre-training task: masked sensor reconstruction
Fine-tuning task: anomaly detection
"""
def __init__(self, n_channels=24, d_model=256, n_heads=8,
n_layers=6, seq_len=200, mask_ratio=0.15):
super().__init__()
self.mask_ratio = mask_ratio
# Patch embedding (treat each timestep as a "token")
self.input_proj = nn.Linear(n_channels, d_model)
self.pos_embedding = nn.Parameter(
torch.randn(1, seq_len, d_model) * 0.02
)
# Transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=n_heads,
dim_feedforward=d_model * 4,
dropout=0.1,
batch_first=True,
)
self.transformer = nn.TransformerEncoder(
encoder_layer, num_layers=n_layers
)
# Pre-training head: reconstruct masked timesteps
self.reconstruction_head = nn.Linear(d_model, n_channels)
# Fine-tuning head: anomaly classification
self.anomaly_head = nn.Sequential(
nn.Linear(d_model, 128),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(128, 2),
)
def forward_pretrain(self, x):
"""Pre-training: masked reconstruction.
x: (batch, n_channels, seq_len)
"""
x = x.transpose(1, 2) # (batch, seq_len, n_channels)
batch_size, seq_len, _ = x.shape
# Create random mask
mask = torch.rand(batch_size, seq_len, device=x.device) < self.mask_ratio
masked_x = x.clone()
masked_x[mask] = 0.0
# Encode
h = self.input_proj(masked_x) + self.pos_embedding[:, :seq_len, :]
h = self.transformer(h)
# Reconstruct
reconstruction = self.reconstruction_head(h)
# Loss only on masked positions
loss = nn.functional.mse_loss(
reconstruction[mask], x[mask]
)
return loss
def forward_anomaly(self, x):
"""Fine-tuning / inference: anomaly detection.
x: (batch, n_channels, seq_len)
"""
x = x.transpose(1, 2)
h = self.input_proj(x) + self.pos_embedding[:, :x.size(1), :]
h = self.transformer(h)
# Global average pooling across time
h_pooled = h.mean(dim=1)
return self.anomaly_head(h_pooled)
Strategy Comparison and Recommendation
| Strategy | Labeled Data Needed | Complexity | Adaptation Speed | Expected Performance |
|---|---|---|---|---|
| 1. DANN | Source only | Medium-High | Slow (retrain) | High |
| 2. Multi-Source BN | Multiple sources | Medium | Fast (BN calibration only) | High |
| 3. BN Fine-Tuning | 100-500 target samples | Low | Very fast (minutes) | Good |
| 4. Contrastive | Source + some target | Medium-High | Moderate | High |
| 5. Normalization | None (unsupervised stats) | Very Low | Instant | Moderate |
| 6. Foundation Model | Minimal per brand | Very High | Fast (once pre-trained) | Highest (with scale) |
Practical Implementation Guide
Data Collection for Cobots
The quality of domain adaptation depends entirely on the quality of the data. For multi-brand cobot anomaly detection, the following considerations apply:
Sensor selection. At a minimum, collect per-joint torque, position, velocity, and motor current. These four signals per joint provide a comprehensive view of the robot's mechanical state. For a six-axis cobot, this yields 24 sensor channels.
Sampling rate. Different brands sample at different rates (UR5e at 500 Hz, FANUC at 250 Hz, KUKA at 1 kHz). Either resample to a common rate, or use architectures that accept variable-length inputs.
Labelling strategy. Labelling anomalies requires domain expertise. A practical approach is to label by operational segment (one pick-and-place cycle) rather than by individual timestep. Use a three-tier scheme—normal, anomalous, and uncertain—and train only on the first two.
Data volume guidelines. For the source brand, aim for at least 10,000 labelled segments (with at least 500 anomalies). For target brands, even 100 to 500 labelled segments enable effective fine-tuning under Strategy 3 or 5.
Feature Engineering for Multi-Joint Cobots
Raw sensor signals can be augmented with engineered features that capture domain-relevant physics:
- Joint torque residuals. The difference between measured torque and the torque expected from the robot's dynamic model. This removes the "normal" torque component (gravity, inertia, friction) and isolates anomalous forces.
- Energy consumption profiles. Power = torque × velocity per joint. Anomalies often manifest as unexpected energy consumption patterns before they appear in raw signals.
- Vibration spectra. FFT of accelerometer or high-frequency torque data. Bearing degradation, gear wear, and loose bolts each have distinctive frequency signatures.
- Kinematic error metrics. The difference between commanded and actual trajectory. Increasing tracking error often precedes mechanical failure.
Model Architecture Choices
| Architecture | Strengths | Weaknesses | Best For |
|---|---|---|---|
| 1D-CNN | Fast, local pattern detection | Limited long-range dependencies | Short anomaly patterns, real-time edge |
| LSTM/GRU | Sequential memory, temporal context | Slow training, vanishing gradients | Long-term degradation patterns |
| LSTM-AutoEncoder | Unsupervised, reconstruction-based | Threshold tuning, slower inference | Minimal labels, novelty detection |
| Transformer | Global attention, parallelizable | Data-hungry, quadratic complexity | Large datasets, complex multi-joint patterns |
| CNN-LSTM Hybrid | Best of both: local + temporal | More hyperparameters | General-purpose (recommended) |
For the cobot scenario, the CNN-LSTM hybrid is typically the best starting point. A complete implementation with domain adaptation support follows:
class CobotCNNLSTMAutoEncoder(nn.Module):
"""CNN-LSTM AutoEncoder with domain adaptation for cobot anomaly detection.
Architecture:
- CNN encoder: extracts local temporal features
- LSTM: captures sequential dependencies
- CNN decoder: reconstructs input signal
- Domain discriminator (optional): for DANN-style adaptation
Anomaly score: reconstruction error (MSE)
"""
def __init__(self, n_channels=24, hidden_dim=128, lstm_layers=2,
n_domains=None):
super().__init__()
# --- Encoder ---
self.conv_encoder = nn.Sequential(
nn.Conv1d(n_channels, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.MaxPool1d(2),
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.MaxPool1d(2),
)
self.lstm_encoder = nn.LSTM(
input_size=128,
hidden_size=hidden_dim,
num_layers=lstm_layers,
batch_first=True,
bidirectional=True,
dropout=0.2,
)
# Bottleneck
self.bottleneck = nn.Linear(hidden_dim * 2, hidden_dim)
# --- Decoder ---
self.lstm_decoder = nn.LSTM(
input_size=hidden_dim,
hidden_size=hidden_dim,
num_layers=lstm_layers,
batch_first=True,
dropout=0.2,
)
self.conv_decoder = nn.Sequential(
nn.Upsample(scale_factor=2),
nn.Conv1d(hidden_dim, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Upsample(scale_factor=2),
nn.Conv1d(128, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Conv1d(64, n_channels, kernel_size=3, padding=1),
)
# Optional domain discriminator
self.domain_discriminator = None
if n_domains is not None:
self.domain_discriminator = nn.Sequential(
GradientReversalLayer(lambda_val=1.0),
nn.Linear(hidden_dim, 64),
nn.ReLU(),
nn.Linear(64, n_domains),
)
def encode(self, x):
"""Encode input to latent representation.
x: (batch, n_channels, seq_len)
"""
# CNN encoding
conv_out = self.conv_encoder(x) # (batch, 128, seq_len//4)
# LSTM encoding
conv_out = conv_out.transpose(1, 2) # (batch, seq_len//4, 128)
lstm_out, _ = self.lstm_encoder(conv_out) # (batch, seq_len//4, 256)
# Take last timestep as global representation
global_repr = lstm_out[:, -1, :] # (batch, 256)
latent = self.bottleneck(global_repr) # (batch, hidden_dim)
return latent, conv_out.shape[1] # return seq_len for decoder
def decode(self, latent, target_seq_len):
"""Decode latent representation back to signal.
latent: (batch, hidden_dim)
"""
# Repeat latent for each timestep
repeated = latent.unsqueeze(1).repeat(1, target_seq_len, 1)
# LSTM decoding
lstm_out, _ = self.lstm_decoder(repeated) # (batch, seq_len, hidden_dim)
# CNN decoding
lstm_out = lstm_out.transpose(1, 2) # (batch, hidden_dim, seq_len)
reconstruction = self.conv_decoder(lstm_out)
return reconstruction
def forward(self, x):
latent, seq_len = self.encode(x)
reconstruction = self.decode(latent, seq_len)
# Ensure reconstruction matches input size
if reconstruction.size(2) != x.size(2):
reconstruction = nn.functional.interpolate(
reconstruction, size=x.size(2), mode='linear',
align_corners=False
)
domain_pred = None
if self.domain_discriminator is not None:
domain_pred = self.domain_discriminator(latent)
return reconstruction, domain_pred, latent
def anomaly_score(self, x):
"""Compute per-sample anomaly score (reconstruction error)."""
reconstruction, _, _ = self.forward(x)
# MSE per sample
mse = ((x - reconstruction) ** 2).mean(dim=(1, 2))
return mse
def train_cobot_autoencoder(model, source_loader, target_loader=None,
n_epochs=100, device='cpu'):
"""Train the CNN-LSTM AutoEncoder with optional domain adaptation."""
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, n_epochs)
model.to(device)
for epoch in range(n_epochs):
model.train()
total_recon_loss = 0
total_domain_loss = 0
target_iter = iter(target_loader) if target_loader else None
for batch_x, _, _ in source_loader:
batch_x = batch_x.to(device)
reconstruction, domain_pred, _ = model(batch_x)
# Match sizes if needed
if reconstruction.size(2) != batch_x.size(2):
reconstruction = nn.functional.interpolate(
reconstruction, size=batch_x.size(2),
mode='linear', align_corners=False
)
recon_loss = nn.functional.mse_loss(reconstruction, batch_x)
total_loss = recon_loss
# Domain adaptation loss (if target data available)
if target_iter is not None and domain_pred is not None:
try:
target_x, _, _ = next(target_iter)
except StopIteration:
target_iter = iter(target_loader)
target_x, _, _ = next(target_iter)
target_x = target_x.to(device)
_, target_domain_pred, _ = model(target_x)
source_domain_labels = torch.zeros(
batch_x.size(0), dtype=torch.long, device=device
)
target_domain_labels = torch.ones(
target_x.size(0), dtype=torch.long, device=device
)
domain_loss = (
nn.functional.cross_entropy(domain_pred, source_domain_labels)
+ nn.functional.cross_entropy(target_domain_pred, target_domain_labels)
)
total_loss += 0.1 * domain_loss
total_domain_loss += domain_loss.item()
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_recon_loss += recon_loss.item()
scheduler.step()
if (epoch + 1) % 10 == 0:
avg_recon = total_recon_loss / len(source_loader)
msg = f"Epoch {epoch+1}/{n_epochs} | Recon: {avg_recon:.6f}"
if target_loader:
avg_domain = total_domain_loss / len(source_loader)
msg += f" | Domain: {avg_domain:.4f}"
print(msg)
return model
Evaluation Metrics
For production cobot anomaly detection, standard accuracy is uninformative. The class imbalance (often 99% normal and 1% anomaly) makes it trivial to obtain high accuracy by predicting "normal" in every case. The following metrics should be used instead:
- AUROC (Area Under the ROC Curve). The primary metric. Measures the model's ability to rank anomalous samples above normal samples regardless of threshold. Aim for above 0.95.
- F1 Score. The harmonic mean of precision and recall at the optimal threshold. Aim for above 0.85.
- Precision@k. If the top-k most anomalous samples are flagged, the fraction that are true anomalies. This is important for maintenance teams that can investigate only a limited number of alerts per shift.
- False Positive Rate (FPR). Perhaps the most important metric in production. Each false positive triggers an unnecessary investigation and erodes trust in the system. Target an FPR below 1% at the operating threshold.
Deployment Considerations
Edge versus cloud. Cobot anomaly detection often must run at the edge, directly on the robot controller or a nearby industrial PC. This constrains model size and inference latency. A CNN-based model with approximately 500K parameters can run inference in under 5 ms on an NVIDIA Jetson. The full CNN-LSTM AutoEncoder (around 2M parameters) requires roughly 20 ms. Transformer models may require cloud deployment.
Inference latency requirements. For real-time safety-critical detection (such as collision avoidance), sub-10 ms inference is required. For predictive maintenance (detecting degradation patterns), latency of 100 ms to 1 s is acceptable, since trends are analysed over minutes or hours.
Model update strategy. Domain drift occurs: sensors degrade, firmware updates change data characteristics, and new operating conditions emerge. Plan for periodic recalibration of BN statistics (weekly) and full fine-tuning (monthly) to maintain performance. Use monitoring to trigger updates: if the anomaly score distribution shifts significantly on data known to be normal, the model requires recalibration.
Putting It Together
Transfer learning is not a single technique but a paradigm that encompasses fine-tuning, domain adaptation, feature extraction, and additional related approaches. Understanding this hierarchy is the first step toward applying it effectively. Fine-tuning adapts a pre-trained model to new data through continued training. Domain adaptation bridges distribution gaps between source and target domains, even without target labels.
For heterogeneous cobot fleets, these techniques are not academic luxuries but operational necessities. The alternative is training separate models for every brand, every firmware version, and every operational context. That path produces an unmaintainable accumulation of models, each requiring its own labelled dataset.
The recommended practical pipeline begins simply: normalise sensor data across brands (Strategy 5) and fine-tune only the batch normalisation layers (Strategy 3). This baseline requires minimal labelled data and can be deployed within hours. If performance falls short, particularly on brands with unusual sensor characteristics, escalate to adversarial domain adaptation (Strategy 1 with DANN) or contrastive methods (Strategy 4). For organisations building long-term cobot intelligence platforms, investment in a foundation model (Strategy 6) yields compounding returns as the fleet grows.
The code examples throughout this article are complete and runnable. They are not production-ready: proper data loading, logging, checkpointing, and monitoring must be added. They do, however, provide the architectural foundation for any of the six strategies discussed. The most demanding aspect of cross-brand cobot anomaly detection is not the algorithm but the collection of representative data and the establishment of a labelling protocol that domain experts can follow consistently.
As collaborative robots become as common as industrial PCs on the factory floor, the ability to transfer anomaly detection across brands will distinguish organisations that scale their automation effectively from those that struggle with model maintenance. Transfer learning, fine-tuning, and domain adaptation are the tools that make such scaling possible.
References
- Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359.
- Ganin, Y., et al. (2016). Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research, 17(1), 2096-2030.
- Sun, B., & Saenko, K. (2016). Deep CORAL: Correlation Alignment for Deep Domain Adaptation. ECCV Workshops.
- Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. ACL 2018.
- Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
- Ansari, A. F., et al. (2024). Chronos: Learning the Language of Time Series. arXiv preprint arXiv:2403.07815.
- Long, M., et al. (2015). Learning Transferable Features with Deep Adaptation Networks. ICML 2015.
- Tzeng, E., et al. (2017). Adversarial Discriminative Domain Adaptation. CVPR 2017.
- Khosla, P., et al. (2020). Supervised Contrastive Learning. NeurIPS 2020.
- Li, Y., et al. (2017). Revisiting Batch Normalization For Practical Domain Adaptation. ICLR Workshop 2017.
- Zhao, H., et al. (2018). Adversarial Multiple Source Domain Adaptation. NeurIPS 2018.
- Courty, N., et al. (2017). Optimal Transport for Domain Adaptation. IEEE TPAMI, 39(9), 1853-1865.
- Das, A., et al. (2024). A Foundation Model for Time Series Analysis. arXiv preprint arXiv:2310.10688 (TimesFM).
- ISO/TS 15066:2016. Robots and robotic devices—Collaborative robots. International Organization for Standardization.
Disclaimer: This article is provided for informational and educational purposes only. Code examples are provided as-is and should be thoroughly tested and validated before use in production environments, particularly in safety-critical robotics applications. Practitioners should follow their organisation's safety protocols and applicable ISO standards when deploying anomaly detection systems on collaborative robots.
Leave a Reply