Cross-sensor remote-sensing SR • Semantic-guided flow matching

Semantic-Guided Cross-Sensor Super Resolution of Remote Sensing Images: A Gated Dual Conditioning Flow Matching Model

RareFlow is a gated dual-conditioning flow-matching model for translating 10 m Sentinel-2 imagery into 2 m Maxar-like imagery while reducing unsupported hallucination.

Forouzan Fallah · Wenwen Li · Chia-Yu Hsu · Hyunho Lee · Anna Liljedahl · Yezhou Yang

Paper (Arxiv) Code (Coming Soon) BibTeX

Overview

RareFlow

semantic + LR control

A semantic-guided, LR-controlled flow-matching framework for cross-sensor remote-sensing SR. It translates Sentinel-2 inputs into Maxar-like outputs while staying grounded in the observed scene.

frozen SD3 gated ControlNet consistency loss

Problem sharp is not enough

Cross-sensor SR is not just sharpening. The output must preserve measured scene layout, recover plausible rare-feature detail, and match the target sensor domain without inventing unsupported structures.

Method gated dual guidance

RareFlow uses a frozen SD3 backbone, trainable LR-conditioned ControlNet residuals, learned alpha gates, semantic guidance from text, and a consistency-guided objective with frequency, color, and perceptual terms.

Results best on 6/8 metrics

On paired RTS data, RareFlow leads six of eight reported metrics — FID 116.16, SAM 3.86, SSIM 0.59, LPIPS 0.36, DISTS 0.30, NIQE 5.36 — and is preferred by domain experts.

Data RTS benchmark

The RTS benchmark pairs 10 m Sentinel-2 imagery with 2 m Maxar targets across Arctic regions. About 800 image pairs come with VLM-generated captions for semantic guidance.

Generalization BreizhSR + SEN2NAIP

Beyond Arctic thaw slumps, RareFlow transfers to BreizhSR and SEN2NAIP, improving FID and several fidelity / perceptual metrics on both public cross-sensor SR benchmarks.

Problem

Cross-sensor SR must be useful, not just sharp

Remote-sensing SR can create images that look realistic but do not match the true landscape. This is risky for rare geomorphic features such as retrogressive thaw slumps, where a plausible texture can still be scientifically wrong.

RareFlow treats the task as target-domain reconstruction: the output should preserve the LR scene layout, add plausible fine detail, and match the target sensor style.

10 m Sentinel-2 input

2 m Maxar-like output

Rare RTS features

Low-resolution input, high-resolution ground truth, and a sharp but wrong super-resolved output. — Motivation: a state-of-the-art SR result can look sharper than the LR input while missing the true morphology visible in the HR reference. An example of a failure case for the state-of-the-art model's super-resolved image is shown in (c). While it appears sharper and more plausible than the low-resolution image in (a), it fails to capture the true morphology of landscape features as shown in the high-resolution image (b).

Abstract

What RareFlow does

High spatial resolution satellite imagery is critical for monitoring fine-scale Earth surface processes, but is often limited by cost and revisit time. This work studies cross-sensor super-resolution (SR) to reduce this gap by translating 10 m Sentinel-2 imagery into 2 m Maxar-like imagery in a data-scarce, domain-shifted setting, with a focus on rare geomorphic features such as retrogressive thaw slumps (RTS). We propose RareFlow, a semantic-guided generative AI framework for cross-sensor super-resolution based on a flow-matching formulation, designed to produce visually plausible and physically reliable high-resolution images.

RareFlow uses dual conditioning to guide the generation process: (1) a gated ControlNet that preserves scene geometry from low-resolution (LR) input, and (2) text-based semantic guidance that injects contextual information when the target phenomenon is rare. To ensure high-fidelity outputs, we introduce a multifaceted loss function that anchors the output to the high-resolution (HR) ground truth by jointly enforcing frequency alignment, perceptual similarity, and color consistency. RareFlow's performance is systematically evaluated on a newly curated benchmark of multi-sensor satellite imagery for rare Earth feature detection, and its generalizability is demonstrated on two public remote sensing benchmarks, SEN2NAIP and BreizhSR. Human evaluation with domain experts is also conducted to further verify RareFlow's effectiveness in generating high-fidelity super-resolved images for scientific analysis.

Drag to compare: Sentinel-2 LR input → RareFlow super-resolved output.

Drag to compare: HR downsampled to LR scale → RareFlow super-resolved output.

Contributions

Key contributions

Gated dual conditioning

RareFlow balances semantic guidance from text with observation guidance from the LR image through learned alpha gates.

Consistency-guided objective

The loss combines flow-matching regression with FFT frequency alignment, CIELAB color consistency, and LPIPS perceptual similarity.

Rare-feature benchmark

The paper builds a multi-sensor RTS benchmark using Sentinel-2 LR inputs and Maxar HR targets under real sensor and time mismatch.

Generalization tests

RareFlow is also evaluated on BreizhSR and SEN2NAIP to test transfer beyond the Arctic RTS setting.

Method

Gated dual-conditioning flow matching

RareFlow keeps the VAE and SD3 MM-DiT backbone frozen. A trainable ControlNet consumes the aligned LR latent and emits residual features for selected backbone blocks. Learned alpha gates scale these residuals before injection, so the model can decide how strongly to trust LR structure at each block.

During training, the HR target is used to create the clean latent and compute consistency losses. During inference, the HR branch is removed; RareFlow only uses the LR image and a text prompt to sample the SR output.

Overview of the RareFlow framework with input data, gated ControlNet, frozen SD3 backbone, and consistency-guided objective. — Framework overview: LR latents and semantic guidance steer a frozen SD3 backbone through a trainable ControlNet and alpha-gate mechanism.

Technical architecture of RareFlow showing dual conditioning, SD3 MM-DiT blocks, and ControlNet MM-DiT blocks. — Technical architecture: learned gates scale ControlNet residuals before they are added to frozen backbone features.

Comparison of LR and HR images and their FFT spectra. — Frequency motivation: HR targets contain stronger mid- and high-frequency content, motivating the FFT alignment loss.

Main results

RareFlow improves realism while keeping structure

116.16

FID, best; 37.94% lower than the best baseline.

3.86

SAM, best; stronger spectral consistency.

0.59

SSIM, best on paired LR-HR data.

0.36 / 0.30

LPIPS / DISTS, best perceptual similarity.

Model	PSNR ↑	SSIM ↑	SAM ↓	LPIPS ↓	DISTS ↓	FID ↓	NIQE ↓	MANIQA ↑
ZoomLDM	17.23	0.26	12.96	0.60	0.59	352.11	18.10	0.19
SeeSR	18.78	0.50	12.26	0.46	0.38	302.36	10.78	0.36
AdcSR	18.59	0.58	12.31	0.40	0.37	187.18	8.38	0.28
MISR-S2	18.39	0.50	12.72	0.54	0.43	254.70	13.55	0.33
SAMSR	18.36	0.54	12.80	0.48	0.39	189.01	11.84	0.32
OpenSR	17.29	0.51	12.59	0.41	0.36	225.62	9.80	0.25
RareFlow	18.76	0.59	3.86	0.36	0.30	116.16	5.36	0.31

Qualitative comparison across LR, HR, baselines, and RareFlow. — Qualitative comparison: RareFlow better matches Maxar-like detail and style than the listed baselines.

Ablation

The full model combines semantic guidance, consistency-guided training, and alpha-gated structural conditioning. This combination gives the strongest overall result: the model achieves the best SSIM, FID, and NIQE, and remains second-best or tied second-best on SAM, LPIPS, DISTS, and MANIQA. These results suggest that the components are complementary. Pre-trained ControlNet supports structural consistency, captions improve target-domain realism, the consistency-guided objective stabilizes caption-guided generation, and the alpha-gate helps balance conditioning strength.

Data

RTS benchmark and data challenges

The benchmark pairs Sentinel-2 Level-1C imagery with Maxar imagery to learn a 10 m to 2 m cross-sensor mapping for retrogressive thaw slump regions across Arctic sites.

The setting is hard because LR-HR pairs can be spatially shifted, temporally mismatched, cloud affected, very small, and limited in number. The paper reports roughly 800 training images.

Spatial mismatch

Temporal mismatch

Cloud occlusion

Limited data

Human evaluation

Domain experts preferred RareFlow over baselines

Human Evaluation Spaces

Experts scored outputs from 1 to 10 with attention to RTS boundaries. RareFlow reached a mean score of 4.6, above AdcSR at 3.2, SeeSR at 2.5, and the LR input at 2.3. HR reference remained the upper bound at 6.3.

Reported agreement was substantial: Krippendorff's alpha = 0.736, Kendall's W = 0.841, and Spearman's rho = 0.673.

Expert evaluation bar chart showing scores for SeeSR, LR, AdcSR, RareFlow, and HR. — Expert scores show RareFlow narrowing the gap between LR input and HR reference.

Generalization

Tests beyond the RTS benchmark

RareFlow was trained and tested on two public cross-sensor SR benchmarks to evaluate transfer beyond Arctic thaw slump imagery.

Dataset	Method	PSNR ↑	SSIM ↑	FSIM ↑	LPIPS ↓	DISTS ↓	FID ↓	NIQE ↓	MANIQA ↑
BreizhSR	RareFlow	11.30	0.18	0.53	0.52	0.31	245.30	7.95	0.32
BreizhSR	MISR-S2	11.28	0.23	0.54	0.63	0.36	254.10	9.36	0.22
SEN2NAIP	RareFlow	14.44	0.29	0.57	0.59	0.32	214.50	6.23	0.19
SEN2NAIP	OpenSR	12.50	0.24	0.61	0.61	0.36	230.95	6.76	0.20

Citation

Cite

@article{fallah2025rareflow, title={RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features}, author={Fallah, Forouzan and Li, Wenwen and Hsu, Chia-Yu and Lee, Hyunho and Yang, Yezhou}, journal={arXiv preprint arXiv:2510.23816}, year={2025} }