SoftShadow: Leveraging Soft Masks for Penumbra-Aware Shadow Removal¶

Conference: CVPR 2025
arXiv: 2409.07041
Code: Yes
Area: Image Shadow Removal / Image Restoration
Keywords: Soft Shadow Mask, Penumbra-Aware, SAM Fine-Tuning, Physics-Constrained Loss, End-to-End Shadow Removal

TL;DR¶

This work proposes the SoftShadow framework, which replaces traditional binary hard masks with continuous grayscale soft masks to represent shadow regions. It predicts soft masks via SAM+LoRA and introduces a penumbra formation constraint loss to jointly train the detection and shadow removal networks, achieving SOTA performance on four datasets (SRD, ISTD+, LRSS, UIUC) without requiring external mask inputs.

Background & Motivation¶

Background: Deep learning has made significant progress in shadow removal. Existing methods are divided into two categories: those relying on pre-generated shadow masks (e.g., BMNet, HomoFormer using binary masks from DHAN/FDRNet detectors) and those directly removing shadows in an end-to-end manner (e.g., DC-ShadowNet, DeS3). However, the latter often underperforms the former due to the lack of explicit shadow location information.

Limitations of Prior Work: All mask-dependent methods use binary masks (\(s \in \{0, 1\}\)) to annotate shadow regions. However, the penumbra area of real shadows exhibits distinct brightness transition gradients, which binary representations fail to encode, leading to severe artifacts near shadow boundaries in the shadow removal results. Furthermore, the mask quality generated by different shadow detectors varies drastically, with PSNR fluctuations up to over 7dB, rendering the system robustness extremely poor.

Key Challenge: The physical formation model of shadows naturally contains both umbra (full occlusion) and penumbra (partial occlusion) regions. The illumination in the penumbra region is continuously graded, but existing mask representations discard this critical information. There is a need for a mask representation that can both precisely locate shadow boundaries and encode penumbra gradient details.

Goal: Design a continuous soft mask (\(s \in [0, 1]\)) to replace binary hard masks, and construct an end-to-end framework to automatically predict soft masks, guiding shadow removal while eliminating dependency on external mask detectors.

Key Insight: Starting from the physical model of shadow formation, a shadow image can be modeled as \(\mathbf{y} = \mathbf{a} \cdot \mathbf{s} \cdot \mathbf{x} + (1 - \mathbf{s}) \cdot \mathbf{x}\), where \(\mathbf{s}\) represents the continuous soft mask. Utilizing the powerful segmentation priors of SAM, LoRA fine-tuning is employed to enable it to output continuous soft masks instead of binary segmentation results, while a physics-inspired penumbra constraint is designed to regularize mask gradients.

Core Idea: Upgrade shadow masks from binary discrete representations to continuous soft representations, and guide SAM to learn gradient transitions in the penumbra region through physical constraints, achieving more natural boundary restoration.

Method¶

Overall Architecture¶

SoftShadow is an end-to-end unified shadow removal framework that does not require any external mask input. The overall pipeline is: input shadow image \(\mathbf{y}\) \(\rightarrow\) SAM+LoRA encoder extracts features and predicts a continuous soft mask \(\hat{\mathbf{s}}\) \(\rightarrow\) feed the soft mask together with the shadow image into the shadow removal backbone network (ShadowDiffusion) \(\rightarrow\) output the shadow-removed result \(\hat{\mathbf{x}}\). During training, three losses are optimized simultaneously: mask reconstruction loss \(\mathcal{L}_{mask}\), penumbra constraint loss \(\mathcal{L}_{pen}\), and shadow removal loss \(\mathcal{L}_{rem}\), jointly fine-tuning SAM and the shadow removal network.

Key Designs¶

SAM+LoRA Soft Mask Detector:
- Function: Transforms the pre-trained SAM from a binary segmenter into a continuous soft mask predictor that outputs \(\hat{\mathbf{s}} \in [0,1]\), where \(s=0\) represents illuminated regions, \(s=1\) represents full shadow regions (umbra), and \(s \in (0,1)\) represents penumbra regions.
- Mechanism: SAM adopts ViT-H as its image encoder, inserting LoRA adaptation layers (rank=8) in all self-attention layers, while the mask decoder undergoes full-parameter fine-tuning. Since SAM was originally trained for binary segmentation, two loss functions, \(\mathcal{L}_{mask}\) and \(\mathcal{L}_{pen}\), are used to guide it to output continuous values. The ground-truth (GT) mask is obtained by converting both the shadow and shadow-free images to the YCbCr space and calculating the Y-channel ratio: \(\mathbf{s}_{gt} = \max(t, f(\mathbf{x}_Y / \mathbf{y}_Y))\), where \(f\) is a low-pass filter and \(t=0.76\) is a threshold.
- Design Motivation: SAM possesses powerful segmentation priors, but directly using SAM to predict shadow masks yields poor performance (PSNR of only 28.16 dB). Fine-tuning with LoRA can significantly reduce trainable parameters while effectively adapting SAM to the continuous mask prediction task, balancing efficiency and effectiveness.
Penumbra Formation Constraint Loss:
- Function: Regularizes the gradient direction and magnitude of the predicted soft mask in the penumbra region to ensure the mask reflects true illumination gradient characteristics.
- Mechanism: Based on two physical assumptions: (1) the mask intensity in the penumbra region should gradually decrease from the shadow center outwards, i.e., the gradient direction should point from the center to the boundary; (2) the gradient transition should be smooth, meaning the gradient magnitude should not be excessively large. First, the penumbra region is defined via thresholds \(t_1, t_2\) as \(\mathbf{w} = \{(i,j) \mid t_1 \leq \mathbf{s}_{i,j} \leq t_2\}\). The shadow center \(\mathbf{c}\) is computed as the mean coordinate of the penumbra region. The direction unit vector is defined as \(\mathbf{d}(\mathbf{w}) = (\mathbf{w} - \mathbf{c}) / \|\mathbf{w} - \mathbf{c}\|\). The constraint loss is formulated as \(\mathcal{L}_{pen} = \mathbb{E}_{n,\mathbf{w}}[\text{ReLU}(\mathbf{d}(\mathbf{w}) \cdot \nabla M(\mathbf{w}))]\). ReLU filters out gradients aligned with the desired direction (negative values), only penalizing gradient components that conflict with the expected direction.
- Design Motivation: Pure mask reconstruction loss struggles to accurately constrain the local gradient characteristics of the penumbra region, which is precisely the area most prone to artifacts in shadow removal. This constraint translates the physical intuition of shadow formation (illumination gradually transitioning from the center outwards) into a differentiable regularization term, explicitly guiding the model to learn soft masks with smooth transitions.
Joint End-to-End Training Strategy:
- Function: Simultaneously optimizes the SAM soft mask detector and the shadow removal backbone network, allowing them to mutually reinforce each other.
- Mechanism: The total loss is defined as \(\mathcal{L} = \mathcal{L}_{mask} + \lambda_1 \mathcal{L}_{pen} + \lambda_2 \mathcal{L}_{rem}\), where \(\lambda_1 = 0.1\) and \(\lambda_2 = 1\). The mask reconstruction loss is the Frobenius norm between the predicted and GT masks: \(\mathcal{L}_{mask} = \mathbb{E}_n \|\hat{\mathbf{s}} - \mathbf{s}_{gt}\|_F^2\). The shadow removal loss is the MSE between the restored image and the GT: \(\mathcal{L}_{rem} = \mathbb{E}_n \|\hat{\mathbf{x}} - \mathbf{x}\|_F^2\). The gradients of the shadow removal loss are backpropagated to SAM, optimizing the predicted soft mask in a direction more beneficial to the shadow removal performance.
- Design Motivation: If the detector and shadow removal network are trained separately, the mask optimization objective might conflict with the shadow removal objective. Joint training feeds the shadow removal performance directly back to mask prediction, forming a virtuous cycle—better masks produce better shadow removal results, which in turn improves mask quality.

Loss & Training¶

The three losses have distinct roles: \(\mathcal{L}_{mask}\) provides global supervision of mask location; \(\mathcal{L}_{pen}\) applies local gradient constraints in the penumbra region; \(\mathcal{L}_{rem}\) provides task-level end-to-end supervision. The training resolution is \(256 \times 256\), with a batch size of 16, using the Adam optimizer. The shadow removal backbone utilizes ShadowDiffusion, with a DDIM sampler and 5 sampling steps during inference. Generating the soft mask GT does not require extra annotations; it is calculated directly using the Y-channel ratio of shadow/shadow-free image pairs in the YCbCr space, followed by low-pass filtering and thresholding.

Key Experimental Results¶

Main Results¶

SRD Dataset (Whole-Image PSNR):

Method	Requires Mask	Shadow PSNR↑	Non-Shadow PSNR↑	Whole-Image PSNR↑	Whole-Image MAE↓
DHAN	DHAN	33.67	34.79	30.51	5.67
ShadowDiffusion	DHAN	38.72	37.78	34.73	3.63
HomoFormer	DHAN	38.81	39.45	35.37	3.33
DeS3	No	37.91	37.45	34.11	3.56
SoftShadow	No	39.08	39.36	35.57	3.11

ISTD+ Dataset (Whole-Image PSNR):

Method	Requires Mask	Shadow PSNR↑	Whole-Image PSNR↑	Whole-Image MAE↓
HomoFormer	GT	39.49	35.35	2.64
ShadowDiffusion	FDRNet	40.12	34.08	3.12
HomoFormer	FDRNet	38.84	32.41	3.51
DeS3	No	36.49	31.38	3.94
SoftShadow	No	40.36	35.00	2.85

Ablation Study¶

Ablation of Stepwise Loss Functions (SRD + LRSS):

Configuration	SRD PSNR↑	SRD MAE↓	LRSS PSNR↑	LRSS MAE↓
Pretrained Weights	31.27	4.41	21.40	12.26
+ \(\mathcal{L}_{rem}\)	35.27	3.32	21.78	12.07
+ \(\mathcal{L}_{rem} + \mathcal{L}_{mask}\)	35.44	3.17	23.08	9.97
+ \(\mathcal{L}_{rem} + \mathcal{L}_{mask} + \mathcal{L}_{pen}\)	35.57	3.11	23.32	9.77

Specialized Evaluation of Penumbra Regions (SRD):

Method	Penumbra PSNR↑	Penumbra MAE↓
Inpaint4Shadow	40.10	4.23
DeS3	40.91	4.08
HomoFormer	40.82	3.91
SoftShadow	41.84	3.77

Key Findings¶

Surpasses all mask-required methods without external masks: Achieves a whole-image PSNR of 35.57 dB on SRD vs HomoFormer (with DHAN mask) at 35.37 dB, yielding a 0.2 dB improvement.
Most significant improvement in penumbra regions: Outperforms HomoFormer by 1.02 dB PSNR in the penumbra area, validating the effectiveness of soft masks in penumbra modeling.
Close to GT-mask-based methods on ISTD+ even without utilizing GT masks (35.00 dB vs 35.35/35.67 dB), whereas detector-mask-based methods only achieve 32-34 dB.
Mask sensitivity analysis indicates that traditional methods are highly reliant on mask quality (PSNR standard deviation of approx. 0.9-1.0 dB), while SoftShadow is fully immune.
\(\mathcal{L}_{mask}\) contributes the most (+1.3 dB on LRSS), and \(\mathcal{L}_{pen}\) further improves edge quality in penumbra regions.
Strong generalization capability: Trained only on SRD, the model dramatically outperforms baselines like DC-ShadowNet on LRSS (23.32 dB) and UIUC (28.85 dB).

Highlights & Insights¶

Problem Redefining from Hard Mask to Soft Mask: A simple yet profound modification where one wonders "why hasn't anyone done this before?". Expanding \(\{0,1\}\) to \([0,1]\) perfectly fits the physical shadow formation model, possessing greater methodological value than technical complexity.
Differentiability of Physical Constraints: Translating the physical intuition that "illumination in the penumbra region gradually transitions from the center outwards" into gradient direction constraints. Selecting and penalizing with ReLU to ensure differentiability provides an elegant way of injecting priors.
Task Adaptation of SAM+LoRA: Transforming SAM from a binary segmenter to a continuous density predictor, using only LoRA to fine-tune the encoder during joint end-to-end training. This approach is both generalizable and parameter-efficient.
GT Soft Mask Generation without Extra Annotation: Directly obtaining ground-truth soft masks via the ratio of Y-channels of shadow/shadow-free image pairs already present in datasets like SRD. This cleverly avoids any need for new annotations.

Limitations & Future Work¶

The advantage is less pronounced on datasets dominated by hard shadows (like ISTD+), where the penumbra region is inherently small, offering limited information gain from soft masks.
The shadow removal backbone utilizes ShadowDiffusion (a diffusion model), whose inference speed is bottlenecked by the number of sampling steps, making it less suitable for real-time applications.
The penumbra constraint assumes shadows have a simple single-center gradient, which may fail in scenes with multiple light sources or complex geometric occlusions.
The generation of GT soft masks depends on a hand-tuned threshold \(t=0.76\); its adaptability to diverse datasets is not thoroughly discussed.
Experiments on self-shadow scenarios are limited; while UIUC shows results, in-depth analysis is lacking.

vs HomoFormer: HomoFormer unifies recovery by homogenizing the spatial distribution of shadow masks, but still inherently relies on binary mask inputs and cannot handle penumbra gradients. SoftShadow not only eliminates the need for mask inputs but also provides richer position information through soft masks.
vs DeS3: DeS3 adopts a mask-free end-to-end approach, removing self-shadows via self-supervised strategies, but lacks explicit shadow position guidance. SoftShadow addresses this drawback through an internal soft mask detector, boosting whole-image PSNR on SRD by 1.46 dB.
vs SAM-helps-shadow: This method directly reads binary masks predicted by SAM into a shadow removal network, being a naive pipeline combination. SoftShadow, conversely, adapts SAM into a continuous mask predictor and trains it jointly so that masks actively assist the shadow removal task, boosting PSNR from 30.72 dB to 35.57 dB (nearly 5 dB).
vs ShadowDiffusion: As the backbone of SoftShadow, ShadowDiffusion depends on external masks and is highly sensitive to mask quality. SoftShadow boosts its performance ceiling from 34.73 dB to 35.57 dB through preceding soft mask detection and joint training.

Rating¶

Novelty: ⭐⭐⭐⭐ The soft mask concept is intuitive and highly effective, representing a fundamental improvement over traditional shadow mask representations, though the technical means (SAM+LoRA, physical constraints) are relatively standard.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive experimental design, incorporating four datasets, specialized penumbra evaluation, mask sensitivity analysis, and stepwise ablation studies.
Writing Quality: ⭐⭐⭐⭐ The physical motivation is clearly articulated, with rigorous logic deriving the soft mask from the shadow formation model, and intuitive illustrations.
Value: ⭐⭐⭐⭐ Actually advances the shadow removal field, with engineering value in eliminating the dependency on external masks.