A Universal Robots UR5e and a FANUC CRX-10iA sit on the same production line, performing identical pick-and-place operations. Both have six joints, both lift the same payload, and both generate streams of torque, position, and velocity data every millisecond. Yet when you train an anomaly detection model on the UR5e’s data and deploy it on the FANUC — even though the task is identical — the model flags nearly everything as anomalous. The sensor noise profiles are different. The control loop frequencies don’t match. The calibration offsets create entirely different data distributions. You have a model that understands what “normal” looks like for one robot, but is completely blind to normalcy on another.
This is not a toy problem. As collaborative robots (cobots) proliferate across manufacturing, logistics, and healthcare, companies increasingly operate heterogeneous fleets — multiple brands, multiple generations, multiple firmware versions. Training a separate anomaly detection model for every brand is expensive, slow, and wasteful. What if the model could transfer its understanding of normal robot behavior across brands?
That is precisely what transfer learning, fine-tuning, and domain adaptation were built to solve. In this guide, we will dissect these three concepts — clarifying exactly how they relate to each other — and then apply them to a real-world scenario: building a cross-brand anomaly detection system for heterogeneous cobots. By the end, you will have not just theoretical understanding but complete, runnable PyTorch code for multiple adaptation strategies.
Before we go further, let’s establish the conceptual hierarchy that will frame this entire discussion:
Transfer Learning (broad paradigm)
├── Fine-Tuning (retrain pre-trained model on new data)
├── Domain Adaptation (bridge distribution gap between domains)
│ ├── Supervised Domain Adaptation
│ ├── Unsupervised Domain Adaptation (UDA)
│ └── Semi-Supervised Domain Adaptation
├── Feature Extraction (freeze pre-trained layers, train new head)
├── Multi-Task Learning (shared representations)
└── Zero-Shot / Few-Shot Transfer
Transfer learning is the big idea: take knowledge learned in one context and apply it in another. Fine-tuning is one way to do it — you take a pre-trained model and continue training it on your target data. Domain adaptation is another way — you specifically address the fact that your source and target data come from different distributions. Feature extraction, multi-task learning, and zero/few-shot transfer are additional strategies under the same umbrella. They are all siblings, not synonyms.
With that map in hand, let’s explore each territory in depth.
Transfer Learning — The Big Picture
Formal Definition
Transfer learning is the paradigm of leveraging knowledge acquired from a source task or domain to improve learning on a target task or domain. Formally, given a source domain DS with a learning task TS, and a target domain DT with a learning task TT, transfer learning aims to improve the learning of the target predictive function fT(·) using knowledge from DS and TS, where DS ≠ DT or TS ≠ TT.
In plain English: you’ve already spent resources learning something useful somewhere. Now you want to reuse that learning instead of starting from zero.
Why Transfer Learning Matters
The motivation is overwhelmingly practical:
- Limited labeled data: Labeling anomalies in cobot sensor data requires domain experts who understand both the robot’s kinematics and the manufacturing process. You might have thousands of labeled samples for one robot brand but almost none for another.
- Expensive annotation: Each labeled anomaly might require a robotics engineer to review hours of sensor logs. At $150/hour, labeling 10,000 samples across five brands could cost more than the robots themselves.
- Faster convergence: A model initialized with transferred knowledge reaches acceptable performance in hours rather than weeks.
- Better generalization: Features learned from large, diverse datasets often capture universal patterns that improve performance even on seemingly unrelated tasks.
Types of Transfer Learning
The taxonomy breaks down based on what differs between source and target:
| Type | Source Labels | Target Labels | Relationship | Example |
|---|---|---|---|---|
| Inductive Transfer | Available | Available | TS ≠ TT | ImageNet classification → medical image segmentation |
| Transductive Transfer | Available | Not available | DS ≠ DT, TS = TT | UR5e anomaly detection → FANUC anomaly detection (no FANUC labels) |
| Unsupervised Transfer | Not available | Not available | DS ≠ DT | Self-supervised pre-training on all cobot data → clustering |
For our cobot scenario, transductive transfer is the most relevant: we have labeled anomaly data from one or a few brands (source domains) and want to perform the same anomaly detection task on new brands (target domains) where labels are scarce or nonexistent.
When Transfer Learning Works — and When It Fails
Transfer learning is not magic. It works when the source and target share some underlying structure. A model trained on ImageNet transfers well to medical imaging because both involve recognizing edges, textures, and shapes. A model trained on English text transfers well to French because both languages share grammatical abstractions.
It fails — sometimes catastrophically — when the source and target are too dissimilar. This is called negative transfer: the transferred knowledge actively hurts performance on the target task. For example, a model trained on satellite imagery might transfer poorly to microscopy images despite both being “images.” The spatial scales, textures, and semantic meanings are fundamentally different.
In our cobot scenario, transfer learning is highly promising because the robots share the same fundamental kinematic structure. A 6-axis articulated arm generates torque profiles that follow similar physical laws regardless of brand. The differences are in sensor calibration, noise characteristics, and control system idiosyncrasies — exactly the kind of distribution shift that domain adaptation was designed to handle.
Historical Context
Transfer learning’s modern era began with the ImageNet revolution. In 2012, AlexNet showed that deep CNNs could learn powerful visual features. By 2014, researchers discovered that these features — especially from early layers — transferred remarkably well to other vision tasks. “ImageNet pre-training” became the default starting point for almost any computer vision project.
NLP followed a similar trajectory. Word2Vec and GloVe provided transferable word embeddings, but the real revolution came with BERT (2018) and GPT (2018-2019), which showed that pre-training on massive text corpora created representations that transferred to virtually any language task. Today, large language models are perhaps the ultimate transfer learning systems — pre-trained on trillions of tokens, then fine-tuned or prompted for specific tasks.
The time-series and industrial AI domains are now experiencing their own transfer learning moment. Models like Chronos, TimesFM, and Lag-Llama are emerging as foundation models for temporal data, and domain adaptation for sensor data is an active area of research with direct industrial applications.
Training From Scratch vs. Transfer Learning
| Factor | From Scratch | Transfer Learning |
|---|---|---|
| Labeled data needed | Large (10k–1M+ samples) | Small (100–1k samples) |
| Training time | Days to weeks | Hours to days |
| Compute cost | High (multi-GPU) | Low to moderate (single GPU) |
| Performance (limited data) | Poor (overfits) | Good to excellent |
| Performance (abundant data) | Excellent (eventually) | Excellent (faster) |
| Domain expertise needed | High (architecture design) | Moderate (strategy selection) |
| Risk of negative transfer | None | Possible if domains too different |
Fine-Tuning — Techniques and Strategies
Fine-tuning is the most widely used transfer learning technique: take a model pre-trained on a source task/domain, and continue training it on your target data. Simple in concept, nuanced in practice.
Full Fine-Tuning vs. Partial Fine-Tuning
Full fine-tuning updates all parameters of the pre-trained model. This gives the model maximum flexibility to adapt but also the highest risk of overfitting — especially when the target dataset is small. If you have 50,000 labeled samples in your target domain, full fine-tuning is usually safe. If you have 500, it’s dangerous.
Partial fine-tuning freezes some layers (typically earlier ones) and only updates the rest. The intuition is that early layers learn generic, transferable features (edge detectors in vision, basic temporal patterns in time-series), while later layers learn task-specific features. By freezing early layers, you preserve the generic knowledge and only adapt the task-specific parts.
Layer-Wise Learning Rate Decay (Discriminative Fine-Tuning)
Rather than the binary freeze/unfreeze decision, discriminative fine-tuning assigns different learning rates to different layers. Earlier layers get smaller learning rates (they should change slowly), while later layers get larger learning rates (they need more adaptation). A common approach is to multiply the learning rate by a decay factor for each layer moving backwards from the output:
# Discriminative learning rates in PyTorch
def get_discriminative_params(model, base_lr=1e-3, decay_factor=0.9):
"""Assign decreasing learning rates to earlier layers."""
params = []
layers = list(model.named_parameters())
n_layers = len(layers)
for i, (name, param) in enumerate(layers):
# Earlier layers get smaller LR
layer_lr = base_lr * (decay_factor ** (n_layers - i - 1))
params.append({
'params': param,
'lr': layer_lr,
'name': name
})
return params
# Usage
param_groups = get_discriminative_params(model, base_lr=1e-3, decay_factor=0.85)
optimizer = torch.optim.AdamW(param_groups)
Gradual Unfreezing
Gradual unfreezing starts by training only the final layer(s), then progressively unfreezes earlier layers as training proceeds. This prevents early layers from being corrupted by the large gradients that occur at the start of fine-tuning when the loss is high. The strategy was popularized by ULMFiT (Universal Language Model Fine-tuning) and works well for both NLP and time-series tasks.
The Fine-Tuning Decision Matrix
The right fine-tuning strategy depends on two factors: how much target data you have, and how similar the source and target domains are.
| Scenario | Target Data Size | Domain Similarity | Recommended Strategy |
|---|---|---|---|
| A | Small (<1k) | High | Feature extraction only (freeze all, train classifier head) |
| B | Small (<1k) | Low | Fine-tune final layers with aggressive regularization |
| C | Large (>10k) | High | Full fine-tuning with small learning rate |
| D | Large (>10k) | Low | Full fine-tuning or train from scratch |
For cobots of the same kinematic structure but different brands, we are firmly in the high domain similarity column. If we have limited labeled data for the target brand (common), Scenario A applies — feature extraction or minimal fine-tuning. If we have substantial data, Scenario C applies — gentle full fine-tuning.
Regularization During Fine-Tuning
Fine-tuning on small datasets risks catastrophic forgetting — the model forgets what it learned during pre-training. Several regularization techniques help:
- L2-SP (L2 penalty Starting Point): Instead of penalizing weights toward zero, penalize them toward their pre-trained values. This keeps the model close to the pre-trained solution while allowing adaptation.
- Dropout: Especially effective when added to fine-tuning layers. Typical values: 0.1–0.3 during fine-tuning vs. 0.5 during training from scratch.
- Early stopping: Monitor validation loss on the target domain and stop when it starts increasing. With small target datasets, overfitting can happen in just a few epochs.
- Weight decay: Standard L2 regularization remains effective, typically at 0.01–0.1 during fine-tuning.
Modern Parameter-Efficient Fine-Tuning
Full fine-tuning updates millions or billions of parameters, which is computationally expensive and requires storing a full copy of the model per task. Parameter-efficient fine-tuning (PEFT) methods address this by updating only a small subset of parameters:
- LoRA (Low-Rank Adaptation): Injects low-rank matrices into each layer. Instead of updating a weight matrix W directly, LoRA decomposes the update as ΔW = BA where B and A are low-rank matrices. This reduces trainable parameters by 10,000x while maintaining performance.
- QLoRA: Combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of large models on a single consumer GPU.
- Adapters: Small bottleneck modules inserted between existing layers. Only adapter parameters are trained; the rest stays frozen.
- Prefix Tuning / Prompt Tuning: Prepend learnable vectors to the input or hidden states. Primarily used in NLP but conceptually applicable to any sequence model.
Fine-Tuning Code Example
Here is a complete example of fine-tuning a PyTorch model with layer freezing and discriminative learning rates for a time-series anomaly detection task:
import torch
import torch.nn as nn
class CobotAnomalyModel(nn.Module):
"""1D-CNN feature extractor + classifier for cobot anomaly detection."""
def __init__(self, n_joints=6, n_features_per_joint=4, seq_len=200):
super().__init__()
in_channels = n_joints * n_features_per_joint # 24 input channels
# Feature extractor (transferable layers)
self.features = nn.Sequential(
nn.Conv1d(in_channels, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1)
)
# Classifier head (task-specific)
self.classifier = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 2) # normal vs anomaly
)
def forward(self, x):
# x shape: (batch, channels, seq_len)
feat = self.features(x).squeeze(-1)
return self.classifier(feat)
def fine_tune_for_new_brand(
pretrained_model,
target_loader,
val_loader,
freeze_features=True,
base_lr=1e-3,
n_epochs=30
):
"""Fine-tune a pre-trained cobot model for a new brand."""
model = pretrained_model
if freeze_features:
# Strategy A: freeze feature extractor, train only classifier
for param in model.features.parameters():
param.requires_grad = False
optimizer = torch.optim.Adam(
model.classifier.parameters(), lr=base_lr
)
else:
# Strategy C: discriminative learning rates
param_groups = [
{'params': model.features.parameters(), 'lr': base_lr * 0.1},
{'params': model.classifier.parameters(), 'lr': base_lr},
]
optimizer = torch.optim.Adam(param_groups)
criterion = nn.CrossEntropyLoss()
best_val_loss = float('inf')
patience_counter = 0
for epoch in range(n_epochs):
model.train()
for batch_x, batch_y in target_loader:
optimizer.zero_grad()
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
# Validation and early stopping
model.eval()
val_loss = 0
with torch.no_grad():
for batch_x, batch_y in val_loader:
output = model(batch_x)
val_loss += criterion(output, batch_y).item()
val_loss /= len(val_loader)
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
torch.save(model.state_dict(), 'best_model.pt')
else:
patience_counter += 1
if patience_counter >= 5:
print(f"Early stopping at epoch {epoch}")
break
model.load_state_dict(torch.load('best_model.pt'))
return model
Domain Adaptation — Bridging the Distribution Gap
While fine-tuning assumes you have at least some labeled data in the target domain, domain adaptation tackles a harder problem: what if you have plenty of labeled data in the source domain but no labels at all in the target domain? This is unsupervised domain adaptation (UDA), and it is the most common and challenging scenario in real-world deployments.
Formal Definition
In domain adaptation, the source and target domains share the same task (e.g., anomaly detection) but have different data distributions. Formally: PS(X) ≠ PT(X), but the labeling function is the same. The goal is to learn a model that performs well on the target distribution despite being trained primarily on the source distribution.
Several types of distribution shift can occur:
- Covariate shift: P(X) changes but P(Y|X) stays the same. The input distributions differ but the relationship between inputs and outputs is preserved. This is the most common scenario for cobots — the sensor data distributions differ across brands, but the definition of “anomaly” remains consistent.
- Label shift: P(Y) changes but P(X|Y) stays the same. The prior probability of classes changes. For example, one brand might have a 2% anomaly rate while another has 5%.
- Concept drift: P(Y|X) changes — the same input means different things in different domains. This is rare for same-structure cobots but could occur if different brands define “normal operating range” differently.
Key Unsupervised Domain Adaptation Methods
Discrepancy-Based Methods
These methods explicitly measure and minimize the distance between source and target feature distributions.
Maximum Mean Discrepancy (MMD) measures the distance between two distributions by comparing their mean embeddings in a reproducing kernel Hilbert space (RKHS). If the mean embeddings are identical, the distributions are identical (for characteristic kernels). In practice, you add an MMD penalty to the training loss that encourages the network to produce similar feature distributions for source and target data.
CORAL (CORrelation ALignment) aligns the second-order statistics (covariance matrices) of source and target features. Deep CORAL integrates this alignment into the network by adding a CORAL loss at one or more hidden layers. The CORAL loss is simply the Frobenius norm of the difference between source and target covariance matrices.
Adversarial-Based Methods
These methods use an adversarial framework to learn domain-invariant features — features that are useful for the task but that a discriminator cannot use to distinguish between source and target domains.
Domain-Adversarial Neural Networks (DANN) are the flagship approach. The architecture has three components: a shared feature extractor, a task classifier (for anomaly detection), and a domain discriminator. The key innovation is the gradient reversal layer (GRL): during backpropagation, gradients from the domain discriminator are reversed before reaching the feature extractor. This means the feature extractor is trained to maximize the domain discriminator’s loss — i.e., to produce features that confuse the discriminator about which domain the data came from.
ADDA (Adversarial Discriminative Domain Adaptation) uses separate feature extractors for source and target, with the target extractor initialized from the source. The adversarial game is played between the target encoder and the discriminator.
CyCADA (Cycle-Consistent Adversarial Domain Adaptation) combines pixel-level adaptation (using CycleGAN-style image translation) with feature-level adaptation. While primarily used for visual tasks, the concept of cycle-consistent adaptation extends to other modalities.
Self-Training and Pseudo-Labeling
Self-training is a conceptually simple but surprisingly effective approach: train on labeled source data, generate predictions (pseudo-labels) on unlabeled target data, and retrain on the combined dataset. The key challenges are noise in pseudo-labels and confirmation bias. Modern approaches use confidence thresholding (only keep high-confidence pseudo-labels) and curriculum learning (start with the most confident predictions and gradually include less confident ones).
Optimal Transport Methods
Optimal transport provides a mathematically principled way to measure and minimize the distance between distributions using the Wasserstein distance. It finds the minimum “cost” of transforming one distribution into another and can be used to explicitly map source features to target features.
Advanced Domain Adaptation Scenarios
The standard UDA setup assumes one source and one target domain. Real-world scenarios are often more complex:
- Multi-source domain adaptation: You have labeled data from multiple source domains (e.g., three cobot brands) and want to adapt to a new target brand. Methods like MDAN (Multi-source Domain Adversarial Networks) and M3SDA handle this by learning domain-specific and domain-shared features simultaneously.
- Partial domain adaptation: The target domain has fewer classes than the source. For example, your source model detects 10 types of anomalies, but the target brand only experiences 6 of them. Standard UDA methods can perform poorly because they try to align classes that don’t exist in the target.
- Open-set domain adaptation: The target domain contains classes not seen in the source. This is realistic for cobots — a new brand might exhibit failure modes not present in the training data. Methods must both adapt known classes and detect unknown target-specific anomalies.
Method Comparison
| Method | Mechanism | Best When | Complexity | Performance |
|---|---|---|---|---|
| MMD | Match kernel mean embeddings | Small domain gap, clean data | Low | Good baseline |
| CORAL | Align covariance matrices | Linear shifts between domains | Low | Good for simple shifts |
| DANN | Adversarial domain confusion | Complex nonlinear shifts | Medium | Strong across scenarios |
| Self-Training | Pseudo-label target data | High-confidence predictions available | Low | Variable (depends on pseudo-label quality) |
| Optimal Transport | Wasserstein distance minimization | Strong theoretical guarantees needed | High | Strong but computationally expensive |
DANN Implementation with Gradient Reversal Layer
Here is a complete PyTorch implementation of a Domain-Adversarial Neural Network:
import torch
import torch.nn as nn
from torch.autograd import Function
class GradientReversalFunction(Function):
"""Gradient Reversal Layer (GRL).
Forward pass: identity function.
Backward pass: negate gradients and scale by lambda.
"""
@staticmethod
def forward(ctx, x, lambda_val):
ctx.lambda_val = lambda_val
return x.clone()
@staticmethod
def backward(ctx, grad_output):
return -ctx.lambda_val * grad_output, None
class GradientReversalLayer(nn.Module):
def __init__(self, lambda_val=1.0):
super().__init__()
self.lambda_val = lambda_val
def forward(self, x):
return GradientReversalFunction.apply(x, self.lambda_val)
class DANN(nn.Module):
"""Domain-Adversarial Neural Network for time-series data."""
def __init__(self, n_input_channels=24, n_classes=2, n_domains=2):
super().__init__()
# Shared feature extractor
self.feature_extractor = nn.Sequential(
nn.Conv1d(n_input_channels, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Conv1d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1), # Global average pooling
)
# Task classifier (anomaly detection)
self.task_classifier = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, n_classes),
)
# Domain discriminator
self.domain_discriminator = nn.Sequential(
GradientReversalLayer(lambda_val=1.0),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, n_domains),
)
def forward(self, x):
features = self.feature_extractor(x).squeeze(-1)
task_output = self.task_classifier(features)
domain_output = self.domain_discriminator(features)
return task_output, domain_output
def set_lambda(self, lambda_val):
"""Update GRL lambda (schedule during training)."""
for module in self.domain_discriminator.modules():
if isinstance(module, GradientReversalLayer):
module.lambda_val = lambda_val
def train_dann(model, source_loader, target_loader, n_epochs=50, device='cpu'):
"""Train DANN with progressive lambda scheduling."""
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
task_criterion = nn.CrossEntropyLoss()
domain_criterion = nn.CrossEntropyLoss()
model.to(device)
for epoch in range(n_epochs):
model.train()
# Progressive lambda: 0 -> 1 over training
p = epoch / n_epochs
lambda_val = 2.0 / (1.0 + torch.exp(torch.tensor(-10.0 * p))) - 1.0
model.set_lambda(lambda_val.item())
# Iterate over both loaders simultaneously
target_iter = iter(target_loader)
for source_x, source_y in source_loader:
try:
target_x, _ = next(target_iter)
except StopIteration:
target_iter = iter(target_loader)
target_x, _ = next(target_iter)
source_x = source_x.to(device)
source_y = source_y.to(device)
target_x = target_x.to(device)
# Source domain: label = 0
source_task_out, source_domain_out = model(source_x)
source_domain_labels = torch.zeros(
source_x.size(0), dtype=torch.long, device=device
)
# Target domain: label = 1 (no task labels!)
_, target_domain_out = model(target_x)
target_domain_labels = torch.ones(
target_x.size(0), dtype=torch.long, device=device
)
# Combined loss
task_loss = task_criterion(source_task_out, source_y)
domain_loss = domain_criterion(source_domain_out, source_domain_labels) \
+ domain_criterion(target_domain_out, target_domain_labels)
total_loss = task_loss + domain_loss
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{n_epochs} | "
f"Task Loss: {task_loss.item():.4f} | "
f"Domain Loss: {domain_loss.item():.4f} | "
f"Lambda: {lambda_val.item():.4f}")
The Cobot Anomaly Detection Scenario
Now let’s apply everything we’ve discussed to a concrete, industrially relevant problem. You manage a factory with multiple collaborative robots from different manufacturers — Universal Robots UR5e, FANUC CRX-10iA, ABB GoFa, KUKA LBR iiwa, and Doosan M1013. All are 6-axis or 7-axis articulated arms performing similar tasks. All generate sensor data: joint torques, positions, velocities, and motor currents.
You want one anomaly detection system that works across all brands, or at least a system that can be quickly adapted to a new brand without collecting thousands of labeled anomaly examples.
The challenge: despite sharing the same kinematic structure, each brand has fundamentally different data distributions due to:
- Sensor characteristics: Different torque sensor resolutions, noise floors, and sampling rates (125 Hz vs 500 Hz vs 1 kHz)
- Control systems: Different PID gains, trajectory planning algorithms, and jerk limits
- Calibration: Different zero-point offsets, gear ratio tolerances, and friction models
- Firmware: Different interpolation methods, filtering strategies, and data encoding
Let’s examine six strategies for tackling this, ranging from simple preprocessing to sophisticated neural domain adaptation.
Strategy 1: Domain-Invariant Feature Learning with DANN
This is the most principled approach. Using the DANN architecture from the previous section, we train on labeled data from one brand (say, UR5e — the most common cobot with the most available data) and use unlabeled data from other brands during training. The gradient reversal layer forces the feature extractor to learn representations that capture anomaly-relevant patterns while being invariant to brand-specific sensor characteristics.
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
class CobotSensorDataset(Dataset):
"""Dataset for multi-joint cobot sensor data.
Each sample: (n_joints * n_features, seq_len) tensor
Features per joint: torque, position, velocity, current
"""
def __init__(self, data, labels, domain_id):
self.data = torch.FloatTensor(data) # (N, channels, seq_len)
self.labels = torch.LongTensor(labels) # (N,) - 0=normal, 1=anomaly
self.domain_id = domain_id
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx], self.domain_id
class CobotDANN(nn.Module):
"""DANN specifically designed for cobot anomaly detection.
Input: multi-joint sensor data (6 joints x 4 features = 24 channels)
Task: binary anomaly detection
Domain: cobot brand identification (adversarial)
"""
def __init__(self, n_joints=6, features_per_joint=4, n_brands=5):
super().__init__()
in_ch = n_joints * features_per_joint
self.encoder = nn.Sequential(
# Block 1: capture local temporal patterns
nn.Conv1d(in_ch, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.MaxPool1d(2),
# Block 2: capture mid-range dependencies
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.MaxPool1d(2),
# Block 3: high-level features
nn.Conv1d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1),
)
self.anomaly_head = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 2),
)
self.domain_head = nn.Sequential(
GradientReversalLayer(lambda_val=1.0),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, n_brands),
)
def forward(self, x):
features = self.encoder(x).squeeze(-1)
anomaly_pred = self.anomaly_head(features)
domain_pred = self.domain_head(features)
return anomaly_pred, domain_pred, features
def predict_anomaly(self, x):
"""Inference: only anomaly prediction needed."""
features = self.encoder(x).squeeze(-1)
return self.anomaly_head(features)
Strategy 2: Multi-Source Domain Adaptation
When you have data from multiple brands, you can leverage all of them simultaneously. The key insight is to use domain-specific batch normalization: each brand gets its own BN layer to handle its unique distribution statistics, while all other weights are shared. This captures the intuition that different brands have different means and variances in their sensor data, but the learned features (convolution filters) should be universal.
class DomainSpecificBatchNorm(nn.Module):
"""Maintain separate BN statistics per domain (brand)."""
def __init__(self, n_features, n_domains):
super().__init__()
self.bn_layers = nn.ModuleList([
nn.BatchNorm1d(n_features) for _ in range(n_domains)
])
self.n_domains = n_domains
def forward(self, x, domain_id):
if self.training:
return self.bn_layers[domain_id](x)
else:
# At inference: use the specified domain's statistics
return self.bn_layers[domain_id](x)
def add_domain(self):
"""Add BN layer for a new brand — initialize from average of existing."""
new_bn = nn.BatchNorm1d(self.bn_layers[0].num_features)
# Initialize with average statistics across existing domains
with torch.no_grad():
avg_mean = torch.stack(
[bn.running_mean for bn in self.bn_layers]
).mean(0)
avg_var = torch.stack(
[bn.running_var for bn in self.bn_layers]
).mean(0)
new_bn.running_mean.copy_(avg_mean)
new_bn.running_var.copy_(avg_var)
self.bn_layers.append(new_bn)
self.n_domains += 1
class MultiSourceCobotModel(nn.Module):
"""Multi-source model with domain-specific batch normalization."""
def __init__(self, n_joints=6, features_per_joint=4, n_brands=5):
super().__init__()
in_ch = n_joints * features_per_joint
self.conv1 = nn.Conv1d(in_ch, 64, kernel_size=7, padding=3)
self.bn1 = DomainSpecificBatchNorm(64, n_brands)
self.conv2 = nn.Conv1d(64, 128, kernel_size=5, padding=2)
self.bn2 = DomainSpecificBatchNorm(128, n_brands)
self.conv3 = nn.Conv1d(128, 256, kernel_size=3, padding=1)
self.bn3 = DomainSpecificBatchNorm(256, n_brands)
self.pool = nn.AdaptiveAvgPool1d(1)
self.classifier = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 2),
)
def forward(self, x, domain_id=0):
x = torch.relu(self.bn1(self.conv1(x), domain_id))
x = torch.relu(self.bn2(self.conv2(x), domain_id))
x = torch.relu(self.bn3(self.conv3(x), domain_id))
x = self.pool(x).squeeze(-1)
return self.classifier(x)
model.bn1.add_domain(), model.bn2.add_domain(), etc. Then run a few hundred unlabeled samples from the new brand through the model to calibrate the new BN statistics. No labeled data required for initial deployment.
Strategy 3: Fine-Tuning with Normalization Alignment
This is the pragmatist’s approach. Pre-train a full anomaly detection model on your best-labeled brand (e.g., UR5e with 50,000 labeled samples). When adapting to a new brand, freeze all convolutional and LSTM weights and only fine-tune the batch normalization layers and the final classifier head.
Why does this work? Because the kinematic structure is the same across brands. The convolutional filters that detect “sudden torque spike in joint 3” or “velocity reversal pattern” are fundamentally the same regardless of brand. What differs is the statistical distribution of the data — exactly what batch normalization captures.
def bn_only_fine_tune(pretrained_model, target_loader, n_epochs=10, lr=1e-3):
"""Fine-tune only BatchNorm layers + classifier for a new cobot brand.
This is the fastest adaptation strategy: typically converges in
5-10 epochs with as few as 100-500 labeled samples.
"""
model = pretrained_model
# Freeze everything
for param in model.parameters():
param.requires_grad = False
# Unfreeze only BatchNorm parameters and classifier
for module in model.modules():
if isinstance(module, nn.BatchNorm1d):
for param in module.parameters():
param.requires_grad = True
# Reset running statistics for the new domain
module.reset_running_stats()
for param in model.classifier.parameters():
param.requires_grad = True
# Collect trainable params
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(trainable, lr=lr)
criterion = nn.CrossEntropyLoss()
print(f"Trainable parameters: {sum(p.numel() for p in trainable):,}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
for epoch in range(n_epochs):
model.train()
total_loss = 0
correct = 0
total = 0
for batch_x, batch_y in target_loader:
optimizer.zero_grad()
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.item()
predicted = output.argmax(dim=1)
correct += (predicted == batch_y).sum().item()
total += batch_y.size(0)
acc = 100.0 * correct / total
avg_loss = total_loss / len(target_loader)
print(f"Epoch {epoch+1}/{n_epochs} | Loss: {avg_loss:.4f} | Acc: {acc:.1f}%")
return model
Strategy 4: Contrastive Domain Adaptation
Contrastive learning provides a powerful alternative to adversarial approaches. The core idea: learn an embedding space where “normal” operation from any brand maps to similar representations, and “anomalous” patterns remain distinguishable regardless of which brand produced them.
We use a Supervised Contrastive (SupCon) loss that pulls together embeddings of the same class (normal/anomaly) regardless of brand, while pushing apart embeddings of different classes:
class SupConDomainLoss(nn.Module):
"""Supervised contrastive loss that ignores domain (brand) labels.
Positive pairs: same anomaly class, any brand
Negative pairs: different anomaly class, any brand
This forces brand-invariant but anomaly-discriminative embeddings.
"""
def __init__(self, temperature=0.07):
super().__init__()
self.temperature = temperature
def forward(self, features, labels):
"""
Args:
features: (batch_size, feature_dim) - L2-normalized embeddings
labels: (batch_size,) - anomaly labels (0=normal, 1=anomaly)
"""
device = features.device
batch_size = features.shape[0]
# Pairwise similarity matrix
similarity = torch.matmul(features, features.T) / self.temperature
# Mask: 1 where labels match (positive pairs), 0 otherwise
labels = labels.unsqueeze(1)
mask = torch.eq(labels, labels.T).float().to(device)
# Remove self-similarity from mask
self_mask = torch.eye(batch_size, device=device)
mask = mask - self_mask
# Numerical stability
logits_max = similarity.max(dim=1, keepdim=True).values.detach()
logits = similarity - logits_max
# Denominator: all pairs except self
exp_logits = torch.exp(logits) * (1 - self_mask)
log_prob = logits - torch.log(exp_logits.sum(dim=1, keepdim=True) + 1e-8)
# Average over positive pairs
n_positives = mask.sum(dim=1)
mean_log_prob = (mask * log_prob).sum(dim=1) / (n_positives + 1e-8)
loss = -mean_log_prob[n_positives > 0].mean()
return loss
class ContrastiveCobotModel(nn.Module):
"""Contrastive model for cross-brand cobot anomaly detection."""
def __init__(self, n_input_channels=24, embed_dim=128):
super().__init__()
self.encoder = nn.Sequential(
nn.Conv1d(n_input_channels, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Conv1d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.AdaptiveAvgPool1d(1),
)
# Projection head for contrastive learning
self.projector = nn.Sequential(
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, embed_dim),
)
# Classifier for anomaly detection
self.classifier = nn.Linear(256, 2)
def forward(self, x):
features = self.encoder(x).squeeze(-1)
projections = nn.functional.normalize(self.projector(features), dim=1)
logits = self.classifier(features)
return logits, projections
Strategy 5: Feature Normalization / Preprocessing Approach
Before reaching for neural domain adaptation, consider whether simple preprocessing can eliminate the distribution gap. This “boring” approach is often underrated and sometimes sufficient:
import numpy as np
from scipy.interpolate import interp1d
class CobotSignalNormalizer:
"""Normalize sensor signals to a common reference frame across brands.
This preprocessing pipeline handles:
1. Sampling rate alignment (resample to common rate)
2. Per-joint Z-score normalization (per brand statistics)
3. Torque residual computation (remove gravity/friction effects)
4. Signal clipping for outlier robustness
"""
def __init__(self, target_sample_rate=250, target_seq_len=200):
self.target_sample_rate = target_sample_rate
self.target_seq_len = target_seq_len
self.brand_stats = {} # {brand: {joint: {feature: (mean, std)}}}
def fit_brand(self, brand_name, data):
"""Compute normalization statistics for a brand.
Args:
brand_name: str, e.g. 'ur5e'
data: np.array of shape (n_samples, n_joints, n_features, seq_len)
"""
n_samples, n_joints, n_features, seq_len = data.shape
stats = {}
for j in range(n_joints):
stats[j] = {}
for f in range(n_features):
channel_data = data[:, j, f, :].flatten()
stats[j][f] = (
float(np.mean(channel_data)),
float(np.std(channel_data)) + 1e-8
)
self.brand_stats[brand_name] = stats
def normalize(self, data, brand_name, source_sample_rate):
"""Normalize a batch of sensor data from a specific brand.
Args:
data: np.array (n_samples, n_joints, n_features, seq_len)
brand_name: str
source_sample_rate: int, Hz
Returns:
Normalized data: np.array (n_samples, n_joints*n_features, target_seq_len)
"""
n_samples, n_joints, n_features, seq_len = data.shape
# Step 1: Resample to common rate
if source_sample_rate != self.target_sample_rate:
source_times = np.linspace(0, 1, seq_len)
target_times = np.linspace(0, 1, self.target_seq_len)
resampled = np.zeros(
(n_samples, n_joints, n_features, self.target_seq_len)
)
for i in range(n_samples):
for j in range(n_joints):
for f in range(n_features):
interpolator = interp1d(
source_times, data[i, j, f, :], kind='cubic'
)
resampled[i, j, f, :] = interpolator(target_times)
data = resampled
# Step 2: Z-score normalization per joint per feature
stats = self.brand_stats[brand_name]
normalized = np.zeros_like(data)
for j in range(n_joints):
for f in range(n_features):
mean, std = stats[j][f]
normalized[:, j, f, :] = (data[:, j, f, :] - mean) / std
# Step 3: Clip to ±5 sigma for robustness
normalized = np.clip(normalized, -5, 5)
# Step 4: Reshape to (n_samples, channels, seq_len)
n_samples = normalized.shape[0]
seq_len = normalized.shape[-1]
output = normalized.reshape(n_samples, n_joints * n_features, seq_len)
return output
Strategy 6: Foundation Model Approach
The most forward-looking approach leverages the emerging ecosystem of time-series foundation models. The idea is to pre-train a large model on data from all available cobot brands in a self-supervised manner (e.g., masked time-series modeling), then fine-tune for anomaly detection with minimal labeled data from each brand.
This approach makes the most sense when you have access to massive amounts of unlabeled sensor data across many brands — which is increasingly common as cobot fleets grow. Models like Chronos (Amazon), TimesFM (Google), and Lag-Llama have shown that transformer-based architectures can learn transferable representations across diverse time-series domains.
class CobotFoundationModel(nn.Module):
"""Simplified foundation model for cobot sensor time-series.
Pre-training task: masked sensor reconstruction
Fine-tuning task: anomaly detection
"""
def __init__(self, n_channels=24, d_model=256, n_heads=8,
n_layers=6, seq_len=200, mask_ratio=0.15):
super().__init__()
self.mask_ratio = mask_ratio
# Patch embedding (treat each timestep as a "token")
self.input_proj = nn.Linear(n_channels, d_model)
self.pos_embedding = nn.Parameter(
torch.randn(1, seq_len, d_model) * 0.02
)
# Transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=n_heads,
dim_feedforward=d_model * 4,
dropout=0.1,
batch_first=True,
)
self.transformer = nn.TransformerEncoder(
encoder_layer, num_layers=n_layers
)
# Pre-training head: reconstruct masked timesteps
self.reconstruction_head = nn.Linear(d_model, n_channels)
# Fine-tuning head: anomaly classification
self.anomaly_head = nn.Sequential(
nn.Linear(d_model, 128),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(128, 2),
)
def forward_pretrain(self, x):
"""Pre-training: masked reconstruction.
x: (batch, n_channels, seq_len)
"""
x = x.transpose(1, 2) # (batch, seq_len, n_channels)
batch_size, seq_len, _ = x.shape
# Create random mask
mask = torch.rand(batch_size, seq_len, device=x.device) < self.mask_ratio
masked_x = x.clone()
masked_x[mask] = 0.0
# Encode
h = self.input_proj(masked_x) + self.pos_embedding[:, :seq_len, :]
h = self.transformer(h)
# Reconstruct
reconstruction = self.reconstruction_head(h)
# Loss only on masked positions
loss = nn.functional.mse_loss(
reconstruction[mask], x[mask]
)
return loss
def forward_anomaly(self, x):
"""Fine-tuning / inference: anomaly detection.
x: (batch, n_channels, seq_len)
"""
x = x.transpose(1, 2)
h = self.input_proj(x) + self.pos_embedding[:, :x.size(1), :]
h = self.transformer(h)
# Global average pooling across time
h_pooled = h.mean(dim=1)
return self.anomaly_head(h_pooled)
Strategy Comparison and Recommendation
| Strategy | Labeled Data Needed | Complexity | Adaptation Speed | Expected Performance |
|---|---|---|---|---|
| 1. DANN | Source only | Medium-High | Slow (retrain) | High |
| 2. Multi-Source BN | Multiple sources | Medium | Fast (BN calibration only) | High |
| 3. BN Fine-Tuning | 100-500 target samples | Low | Very fast (minutes) | Good |
| 4. Contrastive | Source + some target | Medium-High | Moderate | High |
| 5. Normalization | None (unsupervised stats) | Very Low | Instant | Moderate |
| 6. Foundation Model | Minimal per brand | Very High | Fast (once pre-trained) | Highest (with scale) |
Practical Implementation Guide
Data Collection for Cobots
The quality of your domain adaptation depends entirely on the quality of your data. For multi-brand cobot anomaly detection, consider the following:
Sensor selection: At minimum, collect per-joint torque, position, velocity, and motor current. These four signals per joint provide a comprehensive view of the robot's mechanical state. For a 6-axis cobot, that's 24 sensor channels.
Sampling rate: Different brands sample at different rates (UR5e at 500 Hz, FANUC at 250 Hz, KUKA at 1 kHz). Either resample to a common rate or use architectures that handle variable-length inputs.
Labeling strategy: Labeling anomalies requires domain expertise. A practical approach is to label by operational segment (one pick-and-place cycle) rather than by individual timestep. Use a three-tier scheme: normal, anomalous, and uncertain. Only train on the first two.
Data volume guidelines: For the source brand, aim for at least 10,000 labeled segments (with at least 500 anomalies). For target brands, even 100-500 labeled segments enable effective fine-tuning if you use Strategy 3 or 5.
Feature Engineering for Multi-Joint Cobots
Raw sensor signals can be enhanced with engineered features that capture domain-relevant physics:
- Joint torque residuals: The difference between measured torque and expected torque from the robot's dynamic model. This removes the "normal" torque component (gravity, inertia, friction) and isolates anomalous forces.
- Energy consumption profiles: Power = torque × velocity per joint. Anomalies often manifest as unexpected energy consumption patterns before they appear in raw signals.
- Vibration spectra: FFT of accelerometer or high-frequency torque data. Bearing degradation, gear wear, and loose bolts each have distinctive frequency signatures.
- Kinematic error metrics: Difference between commanded and actual trajectory. Increasing tracking error often precedes mechanical failure.
Model Architecture Choices
| Architecture | Strengths | Weaknesses | Best For |
|---|---|---|---|
| 1D-CNN | Fast, local pattern detection | Limited long-range dependencies | Short anomaly patterns, real-time edge |
| LSTM/GRU | Sequential memory, temporal context | Slow training, vanishing gradients | Long-term degradation patterns |
| LSTM-AutoEncoder | Unsupervised, reconstruction-based | Threshold tuning, slower inference | Minimal labels, novelty detection |
| Transformer | Global attention, parallelizable | Data-hungry, quadratic complexity | Large datasets, complex multi-joint patterns |
| CNN-LSTM Hybrid | Best of both: local + temporal | More hyperparameters | General-purpose (recommended) |
For the cobot scenario, the CNN-LSTM hybrid is typically the best starting point. Here's a complete implementation with domain adaptation support:
class CobotCNNLSTMAutoEncoder(nn.Module):
"""CNN-LSTM AutoEncoder with domain adaptation for cobot anomaly detection.
Architecture:
- CNN encoder: extracts local temporal features
- LSTM: captures sequential dependencies
- CNN decoder: reconstructs input signal
- Domain discriminator (optional): for DANN-style adaptation
Anomaly score: reconstruction error (MSE)
"""
def __init__(self, n_channels=24, hidden_dim=128, lstm_layers=2,
n_domains=None):
super().__init__()
# --- Encoder ---
self.conv_encoder = nn.Sequential(
nn.Conv1d(n_channels, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.MaxPool1d(2),
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.MaxPool1d(2),
)
self.lstm_encoder = nn.LSTM(
input_size=128,
hidden_size=hidden_dim,
num_layers=lstm_layers,
batch_first=True,
bidirectional=True,
dropout=0.2,
)
# Bottleneck
self.bottleneck = nn.Linear(hidden_dim * 2, hidden_dim)
# --- Decoder ---
self.lstm_decoder = nn.LSTM(
input_size=hidden_dim,
hidden_size=hidden_dim,
num_layers=lstm_layers,
batch_first=True,
dropout=0.2,
)
self.conv_decoder = nn.Sequential(
nn.Upsample(scale_factor=2),
nn.Conv1d(hidden_dim, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Upsample(scale_factor=2),
nn.Conv1d(128, 64, kernel_size=7, padding=3),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Conv1d(64, n_channels, kernel_size=3, padding=1),
)
# Optional domain discriminator
self.domain_discriminator = None
if n_domains is not None:
self.domain_discriminator = nn.Sequential(
GradientReversalLayer(lambda_val=1.0),
nn.Linear(hidden_dim, 64),
nn.ReLU(),
nn.Linear(64, n_domains),
)
def encode(self, x):
"""Encode input to latent representation.
x: (batch, n_channels, seq_len)
"""
# CNN encoding
conv_out = self.conv_encoder(x) # (batch, 128, seq_len//4)
# LSTM encoding
conv_out = conv_out.transpose(1, 2) # (batch, seq_len//4, 128)
lstm_out, _ = self.lstm_encoder(conv_out) # (batch, seq_len//4, 256)
# Take last timestep as global representation
global_repr = lstm_out[:, -1, :] # (batch, 256)
latent = self.bottleneck(global_repr) # (batch, hidden_dim)
return latent, conv_out.shape[1] # return seq_len for decoder
def decode(self, latent, target_seq_len):
"""Decode latent representation back to signal.
latent: (batch, hidden_dim)
"""
# Repeat latent for each timestep
repeated = latent.unsqueeze(1).repeat(1, target_seq_len, 1)
# LSTM decoding
lstm_out, _ = self.lstm_decoder(repeated) # (batch, seq_len, hidden_dim)
# CNN decoding
lstm_out = lstm_out.transpose(1, 2) # (batch, hidden_dim, seq_len)
reconstruction = self.conv_decoder(lstm_out)
return reconstruction
def forward(self, x):
latent, seq_len = self.encode(x)
reconstruction = self.decode(latent, seq_len)
# Ensure reconstruction matches input size
if reconstruction.size(2) != x.size(2):
reconstruction = nn.functional.interpolate(
reconstruction, size=x.size(2), mode='linear',
align_corners=False
)
domain_pred = None
if self.domain_discriminator is not None:
domain_pred = self.domain_discriminator(latent)
return reconstruction, domain_pred, latent
def anomaly_score(self, x):
"""Compute per-sample anomaly score (reconstruction error)."""
reconstruction, _, _ = self.forward(x)
# MSE per sample
mse = ((x - reconstruction) ** 2).mean(dim=(1, 2))
return mse
def train_cobot_autoencoder(model, source_loader, target_loader=None,
n_epochs=100, device='cpu'):
"""Train the CNN-LSTM AutoEncoder with optional domain adaptation."""
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, n_epochs)
model.to(device)
for epoch in range(n_epochs):
model.train()
total_recon_loss = 0
total_domain_loss = 0
target_iter = iter(target_loader) if target_loader else None
for batch_x, _, _ in source_loader:
batch_x = batch_x.to(device)
reconstruction, domain_pred, _ = model(batch_x)
# Match sizes if needed
if reconstruction.size(2) != batch_x.size(2):
reconstruction = nn.functional.interpolate(
reconstruction, size=batch_x.size(2),
mode='linear', align_corners=False
)
recon_loss = nn.functional.mse_loss(reconstruction, batch_x)
total_loss = recon_loss
# Domain adaptation loss (if target data available)
if target_iter is not None and domain_pred is not None:
try:
target_x, _, _ = next(target_iter)
except StopIteration:
target_iter = iter(target_loader)
target_x, _, _ = next(target_iter)
target_x = target_x.to(device)
_, target_domain_pred, _ = model(target_x)
source_domain_labels = torch.zeros(
batch_x.size(0), dtype=torch.long, device=device
)
target_domain_labels = torch.ones(
target_x.size(0), dtype=torch.long, device=device
)
domain_loss = (
nn.functional.cross_entropy(domain_pred, source_domain_labels)
+ nn.functional.cross_entropy(target_domain_pred, target_domain_labels)
)
total_loss += 0.1 * domain_loss
total_domain_loss += domain_loss.item()
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_recon_loss += recon_loss.item()
scheduler.step()
if (epoch + 1) % 10 == 0:
avg_recon = total_recon_loss / len(source_loader)
msg = f"Epoch {epoch+1}/{n_epochs} | Recon: {avg_recon:.6f}"
if target_loader:
avg_domain = total_domain_loss / len(source_loader)
msg += f" | Domain: {avg_domain:.4f}"
print(msg)
return model
Evaluation Metrics
For production cobot anomaly detection, standard accuracy is meaningless — the class imbalance (often 99% normal, 1% anomaly) makes it trivial to achieve high accuracy by predicting "normal" always. Use these metrics instead:
- AUROC (Area Under ROC Curve): The primary metric. Measures the model's ability to rank anomalous samples higher than normal samples regardless of threshold. Aim for > 0.95.
- F1 Score: The harmonic mean of precision and recall at the optimal threshold. Aim for > 0.85.
- Precision@k: If you flag the top-k most anomalous samples, what fraction are true anomalies? Critical for maintenance teams who can only investigate a limited number of alerts per shift.
- False Positive Rate (FPR): Perhaps the most critical metric in production. Each false positive triggers an unnecessary investigation, reducing trust in the system. Target FPR < 1% at your operating threshold.
Deployment Considerations
Edge vs. Cloud: Cobot anomaly detection often needs to run at the edge — directly on the robot controller or a nearby industrial PC. This constrains model size and inference latency. A CNN-based model with ~500K parameters can run inference in under 5ms on an NVIDIA Jetson. The full CNN-LSTM AutoEncoder (~2M parameters) needs about 20ms. Transformer models may require cloud deployment.
Inference latency requirements: For real-time safety-critical detection (e.g., collision avoidance), you need sub-10ms inference. For predictive maintenance (detecting degradation patterns), latency of 100ms–1s is acceptable since you're analyzing trends over minutes or hours.
Model update strategy: Domain drift happens — sensors degrade, firmware updates change data characteristics, and new operating conditions emerge. Plan for periodic re-calibration of BN statistics (weekly) and full fine-tuning (monthly) to maintain performance. Use monitoring to trigger updates: if anomaly score distributions shift significantly on data you know is normal, the model needs recalibration.
Conclusion
Transfer learning is not a single technique — it is a paradigm that encompasses fine-tuning, domain adaptation, feature extraction, and more. Understanding this hierarchy is the first step toward applying it effectively. Fine-tuning adapts a pre-trained model to new data through continued training. Domain adaptation bridges distribution gaps between source and target domains, even without target labels.
For heterogeneous cobot fleets, these techniques are not academic luxuries — they are operational necessities. The alternative is training separate models for every brand, every firmware version, and every operational context. That path leads to an unmaintainable jungle of models, each demanding its own labeled dataset.
The practical pipeline we recommend starts simple: normalize your sensor data across brands (Strategy 5) and fine-tune only the batch normalization layers (Strategy 3). This baseline requires minimal labeled data and can be deployed in hours. If performance falls short — particularly on brands with unusual sensor characteristics — escalate to adversarial domain adaptation (Strategy 1 with DANN) or contrastive methods (Strategy 4). For organizations building long-term cobot intelligence platforms, investing in a foundation model (Strategy 6) will yield compounding returns as the fleet grows.
The code examples throughout this post are complete and runnable. They are not production-ready — you'll need to add proper data loading, logging, checkpointing, and monitoring — but they provide the architectural foundation for any of the six strategies we discussed. The hardest part of cross-brand cobot anomaly detection is not the algorithm; it is collecting representative data and establishing a labeling protocol that domain experts can follow consistently.
As collaborative robots become as common as industrial PCs on the factory floor, the ability to transfer anomaly detection intelligence across brands will separate the organizations that scale their automation from those that drown in model maintenance. Transfer learning, fine-tuning, and domain adaptation are the tools that make that scaling possible.
References
- Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359.
- Ganin, Y., et al. (2016). Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research, 17(1), 2096-2030.
- Sun, B., & Saenko, K. (2016). Deep CORAL: Correlation Alignment for Deep Domain Adaptation. ECCV Workshops.
- Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. ACL 2018.
- Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
- Ansari, A. F., et al. (2024). Chronos: Learning the Language of Time Series. arXiv preprint arXiv:2403.07815.
- Long, M., et al. (2015). Learning Transferable Features with Deep Adaptation Networks. ICML 2015.
- Tzeng, E., et al. (2017). Adversarial Discriminative Domain Adaptation. CVPR 2017.
- Khosla, P., et al. (2020). Supervised Contrastive Learning. NeurIPS 2020.
- Li, Y., et al. (2017). Revisiting Batch Normalization For Practical Domain Adaptation. ICLR Workshop 2017.
- Zhao, H., et al. (2018). Adversarial Multiple Source Domain Adaptation. NeurIPS 2018.
- Courty, N., et al. (2017). Optimal Transport for Domain Adaptation. IEEE TPAMI, 39(9), 1853-1865.
- Das, A., et al. (2024). A Foundation Model for Time Series Analysis. arXiv preprint arXiv:2310.10688 (TimesFM).
- ISO/TS 15066:2016. Robots and robotic devices — Collaborative robots. International Organization for Standardization.
Disclaimer: This article is for informational and educational purposes only. Any code examples are provided as-is and should be thoroughly tested and validated before use in production environments, especially in safety-critical robotics applications. Always follow your organization's safety protocols and applicable ISO standards when deploying anomaly detection systems on collaborative robots.
Leave a Reply