RepoPilotOpen in app →

microsoft/Swin-Transformer

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

Mixed

Stale — last commit 2y ago

weakest axis
Use as dependencyMixed

last commit was 2y ago; no tests detected…

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isMixed

last commit was 2y ago; no CI workflows detected

  • 14 active contributors
  • Distributed ownership (top contributor 40% of recent commits)
  • MIT licensed
Show all 6 evidence items →
  • Stale — last commit 2y ago
  • No CI workflows detected
  • No test directory detected
What would change the summary?
  • Use as dependency MixedHealthy if: 1 commit in the last 365 days; add a test suite
  • Deploy as-is MixedHealthy if: 1 commit in the last 180 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Forkable
[![RepoPilot: Forkable](https://repopilot.app/api/badge/microsoft/swin-transformer?axis=fork)](https://repopilot.app/r/microsoft/swin-transformer)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/microsoft/swin-transformer on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: microsoft/Swin-Transformer

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/microsoft/Swin-Transformer shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 2y ago

  • 14 active contributors
  • Distributed ownership (top contributor 40% of recent commits)
  • MIT licensed
  • ⚠ Stale — last commit 2y ago
  • ⚠ No CI workflows detected
  • ⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live microsoft/Swin-Transformer repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/microsoft/Swin-Transformer.

What it runs against: a local clone of microsoft/Swin-Transformer — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in microsoft/Swin-Transformer | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 681 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>microsoft/Swin-Transformer</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of microsoft/Swin-Transformer. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/microsoft/Swin-Transformer.git
#   cd Swin-Transformer
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of microsoft/Swin-Transformer and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "microsoft/Swin-Transformer(\\.git)?\\b" \\
  && ok "origin remote is microsoft/Swin-Transformer" \\
  || miss "origin remote is not microsoft/Swin-Transformer (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "models/swin_transformer.py" \\
  && ok "models/swin_transformer.py" \\
  || miss "missing critical file: models/swin_transformer.py"
test -f "main.py" \\
  && ok "main.py" \\
  || miss "missing critical file: main.py"
test -f "models/build.py" \\
  && ok "models/build.py" \\
  || miss "missing critical file: models/build.py"
test -f "data/build.py" \\
  && ok "data/build.py" \\
  || miss "missing critical file: data/build.py"
test -f "config.py" \\
  && ok "config.py" \\
  || miss "missing critical file: config.py"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 681 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~651d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/microsoft/Swin-Transformer"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Official implementation of Swin Transformer, a hierarchical vision transformer architecture using shifted windows that achieves state-of-the-art results on image classification, object detection, instance segmentation, and semantic segmentation. The core innovation is replacing the standard global self-attention in vision transformers with local window-based attention that shifts between layers, dramatically reducing computational complexity while maintaining accuracy. Monolithic repo structure: core model implementations likely in a models/ directory, 60+ YAML config files organized by task (configs/swin/, configs/swinv2/, configs/simmim/, configs/swinmoe/, configs/swinmlp/), and central config.py for hyperparameter management. Supports multiple training paradigms: supervised learning, fine-tuning from ImageNet-22k, masked image modeling (SimMIM), and mixture-of-experts routing, each with dedicated config folders.

👥Who it's for

Computer vision researchers and practitioners building image classification models, object detection systems, or semantic segmentation pipelines who need a production-ready transformer backbone with strong ImageNet/COCO benchmarks. Model developers performing transfer learning from pre-trained Swin checkpoints on downstream vision tasks.

🌱Maturity & risk

Highly mature and production-ready. This is an official Microsoft research implementation with extensive model variants (Tiny/Small/Base/Large, v1/v2), SOTA benchmarks on COCO and ADE20K, and 2+ years of active development. Well-established with reference implementations across multiple vision tasks via companion repos. Clear maintenance cadence with recent feature additions (SwinV2, MoE variants, SimMIM support) as of late 2022.

Standard open source risks apply.

Active areas of work

Active development on model variants: Swin Transformer V2 (with scaled dot-product attention), Swin-MoE (Mixture of Experts with 8/16/32/64 expert configurations), SwinMLP (attention-free alternative), and SimMIM pre-training framework. NVIDIA FasterTransformer integration announced for optimized inference on T4/A100 GPUs. Latest commits (late 2022) focus on scaling and efficiency improvements.

🚀Get running

git clone https://github.com/microsoft/Swin-Transformer && cd Swin-Transformer && pip install -r requirements.txt (inferred—check repo root for requirements.txt/setup.py). Then review get_started.md for dataset setup (ImageNet-1K or ImageNet-22K) and config selection from configs/swin/ directory.

Daily commands: Expected flow (from README context): (1) prepare ImageNet dataset; (2) select config from configs/swin/*.yaml (e.g., swin_tiny_patch4_window7_224.yaml); (3) python -m torch.distributed.launch --nproc_per_node=<gpu_count> main.py --cfg <config_path> --data-path <imagenet_path> --batch-size <bs>. For fine-tuning: use 22kto1k_finetune.yaml variant and resume from checkpoint. See get_started.md for exact commands.

🗺️Map of the codebase

  • models/swin_transformer.py — Core implementation of Swin Transformer architecture with window-based multi-head self-attention; essential for understanding the hierarchical vision transformer design.
  • main.py — Primary training and evaluation entry point; defines the full pipeline for model training, validation, and distributed setup.
  • models/build.py — Model factory that instantiates Swin variants (base, small, tiny, large) from config; required to understand model creation and parameter initialization.
  • data/build.py — Data pipeline builder for ImageNet and ImageNet-22K; critical for understanding dataset loading, augmentation, and sampling strategies.
  • config.py — Centralized configuration management using YACS; every training run depends on understanding how configs are parsed and applied.
  • models/swin_transformer_v2.py — V2 enhancements including residual post-norm and continuous relative position bias; shows architectural evolution and modern improvements.
  • optimizer.py — Optimizer and learning rate scheduler configuration with layer-wise decay; critical for reproducing published results.

🛠️How to make changes

Add a new Swin model variant

  1. Create YAML config in configs/swin/ with model name, depth, num_heads, embed_dim, and window_size parameters (configs/swin/swin_custom_patch4_window7_224.yaml)
  2. Update models/build.py get_model() factory to recognize the new config name and instantiate properly (models/build.py)
  3. Optionally update config.py if introducing new hyperparameters specific to the variant (config.py)
  4. Run main.py with the new config: python main.py --cfg configs/swin/swin_custom_patch4_window7_224.yaml (main.py)

Add support for a new dataset

  1. Create a new dataset class in data/ inheriting from torch.utils.data.Dataset with getitem and len (data/cached_image_folder.py)
  2. Add dataset instantiation logic to data/build.py build_dataset() function with appropriate transforms (data/build.py)
  3. Add corresponding config entries in config.py for dataset_name, data_path, num_classes, img_size (config.py)
  4. Update data samplers in data/samplers.py if custom sampling strategy is needed for the dataset (data/samplers.py)

Integrate a new training objective (pretraining method)

  1. Create main_newmethod.py copying structure from main_simmim_pt.py with custom loss computation (main_simmim_pt.py)
  2. Create wrapper model in models/ (e.g., models/newmethod.py) that adds task-specific heads to Swin (models/simmim.py)
  3. Add pretraining utils in utils_newmethod.py for loss calculation and metric tracking (utils_simmim.py)
  4. Create config files in configs/newmethod/ with pretraining and finetuning configurations (configs/simmim/simmim_pretrain__swin_base__img192_window6__800ep.yaml)

Tune learning rate and optimization for a new task

  1. Copy an existing config and modify lr, warmup_epochs, weight_decay, and layer_decay parameters (configs/swin/swin_base_patch4_window7_224.yaml)
  2. Review and customize layer-wise LR decay schedule in optimizer.py for your model depth (optimizer.py)
  3. Adjust scheduler parameters in lr_scheduler.py (e.g., cosine annealing epochs, linear warmup steps) (lr_scheduler.py)
  4. Run training with new config and monitor loss curves via logger output (main.py)

🔧Why these technologies

  • PyTorch — Industry-standard deep learning framework with strong distributed training (DDP) and mixed precision (AMP) support required for large-scale vision model training.
  • YACS for configuration — Centralized, hierarchical config management allows reproducible experiments across multiple model variants and datasets without code changes.
  • CUDA kernels for window operations — Custom kernels optimize the critical window partitioning and attention operations that are central to Swin's efficiency gains over dense attention.
  • Distributed Data Parallel (DDP) — Multi-GPU/multi-node training is essential for pretraining on ImageNet-22K (14M images); DDP provides minimal overhead and deterministic gradients.
  • Automatic Mixed Precision (AMP) — Reduces memory footprint and training time by ~2x while maintaining accuracy; critical for training large models on limited hardware.

⚖️Trade-offs already made

  • Shifted window attention instead of dense global attention

    • Why: Reduces computational complexity from O(N²) to O(N·W²) where W is window size, enabling efficient training on high-resolution images.
    • Consequence: Requires careful window padding/shifting logic and custom CUDA kernels; adds code complexity but provides 2-4x speedup vs ViT on same hardware.
  • Hierarchical stages with patch merging

    • Why: Produces multi-scale feature maps suitable for dense prediction tasks (detection, segmentation) and allows efficient feature reuse across stages.
    • Consequence: Model architecture is more complex than flat ViT; requires task-specific decoder heads, but enables single backbone for diverse tasks.

    • Why: undefined
    • Consequence: undefined

🪤Traps & gotchas

(1) Window size parameter (e.g., window7, window12) is fixed at training time—changing it at inference breaks positional encoding. (2) ImageNet-22K pre-training checkpoints require specific fine-tuning configs (configs/swin/*22kto1k_finetune.yaml); naive loading will fail due to layer mismatch. (3) Multi-GPU training via torch.distributed requires MASTER_ADDR/MASTER_PORT env vars set correctly. (4) CUDA compilation for custom ops—CPU-only mode possible but much slower. (5) Config naming convention (swin_<size>_patch<p>_window<w>) is strict; typos in config.py YAML matching will silently use defaults. (6) SimMIM pre-training uses different masking strategy than standard MAE—configs in configs/simmim/ are not interchangeable with supervised configs.

🏗️Architecture

💡Concepts to learn

  • Shifted Window Attention — The core innovation of Swin Transformer—partitions feature maps into non-overlapping windows and applies local self-attention with periodic shifts between layers to enable cross-window interaction without quadratic complexity. Understanding this is essential to grasp why Swin outperforms dense-attention ViT.
  • Hierarchical multi-stage architecture — Unlike ViT which maintains fixed resolution, Swin progressively reduces spatial dimensions across stages (4x→8x→16x→32x downsampling) while increasing channel depth. This design enables efficient multi-scale feature extraction critical for detection and segmentation tasks.
  • Relative position bias — Swin uses learnable relative position biases instead of absolute position embeddings; enables flexibility in handling variable input resolutions (configs show 192, 224, 256, 384 variants). Position interpolation techniques used in SwinV2 are non-obvious and critical for transfer learning.
  • Mixture of Experts (MoE) routing — SwinMoE configs (configs/swinmoe/) replace dense FFN layers with sparse expert networks; each token routes to subset of experts, enabling scaling to 1000+ parameters without proportional compute increase. Essential for understanding parameter efficiency vs. FLOPs trade-offs.
  • Masked Image Modeling (SimMIM) — SimMIM pre-training (configs/simmim/) randomly masks image patches and reconstructs pixel values; fundamentally different from supervised ImageNet-22K pre-training. Configs show this achieves competitive downstream performance with different convergence properties.
  • Cyclic shift for efficiency — Swin avoids expensive padding by using cyclic shifts (torch.roll) then computing an attention mask; this avoids creating extra padded tokens but requires careful mask computation. Implementation detail visible in window attention code that impacts both speed and memory usage.
  • Knowledge distillation from ImageNet-22K — 22kto1k_finetune configs show transfer from 14M ImageNet-22K pre-training to 1.3M ImageNet-1K fine-tuning; involves layer dropout, learning rate scheduling, and knowledge retention strategies not obvious from standard supervised fine-tuning.
  • facebookresearch/vision_transformer — Original ViT (Vision Transformer) implementation that inspired Swin; understanding ViT's global attention mechanism clarifies why shifted windows are needed
  • SwinTransformer/Swin-Transformer-Object-Detection — Official companion repo using Swin backbone for COCO object detection and instance segmentation; essential for understanding how Swin scales to detection tasks
  • SwinTransformer/Video-Swin-Transformer — Extends Swin to video action recognition on Kinetics-400; shows how 3D shifted windows adapt the core attention mechanism for temporal modeling
  • microsoft/Transformer-SSL — Self-supervised learning framework (contrastive) compatible with Swin; demonstrates pre-training alternatives to ImageNet-22K supervised learning

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for kernels/window_process module

The window_process kernel is a critical CUDA/C++ component for the shifted window attention mechanism (core to Swin Transformer), but only has a single unit_test.py file with no visible test coverage in the repo. New contributors could expand this with pytest-based tests covering edge cases, different batch sizes, window configurations, and CUDA/CPU fallbacks to ensure robustness across deployment scenarios.

  • [ ] Create tests/test_window_process.py with comprehensive test cases
  • [ ] Add tests for different window sizes (7, 8, 12, 14, 16, 24) matching configs/swin* and configs/swinv2*
  • [ ] Test edge cases: batch_size=1, misaligned image dimensions, non-square windows
  • [ ] Add GPU/CPU device agnostic tests using pytest fixtures
  • [ ] Integrate with kernels/window_process/unit_test.py and document in get_started.md

Add data loading unit tests for configs validation

The repo has 40+ YAML config files across swin/, swinv2/, simmim/, swinmlp/, and swinmoe/ subdirectories, but no tests validating that config.py correctly parses these configs or that data/build.py properly instantiates dataloaders. New contributors could create tests ensuring each config loads without errors and produces valid data shapes.

  • [ ] Create tests/test_config_loading.py to validate all YAML configs in configs/ parse correctly
  • [ ] Add tests/test_dataloader.py to verify data/build.py, data/cached_image_folder.py, and data/imagenet22k_dataset.py work with sample configs
  • [ ] Test data/data_simmim_pt.py and data/data_simmim_ft.py with mock datasets
  • [ ] Verify sampler instantiation in data/samplers.py handles distributed scenarios
  • [ ] Document expected data directory structure in get_started.md

Add CI workflow for kernel compilation and model inference tests

While the repo has kernels/window_process with CUDA code, there's no visible GitHub Actions workflow testing that the kernel compiles correctly or that end-to-end inference works. New contributors could add a GitHub Actions workflow that compiles the kernel, runs unit tests, and validates inference on sample configs (swin_tiny, swinv2_tiny) to catch regressions.

  • [ ] Create .github/workflows/kernel_build_test.yml to test kernels/window_process compilation
  • [ ] Add CUDA 11.x and CPU-only build variants in the workflow
  • [ ] Run kernels/window_process/unit_test.py in the workflow
  • [ ] Create .github/workflows/model_inference_test.yml testing inference with configs/swin/swin_tiny_patch4_window7_224.yaml
  • [ ] Test with pre-trained weights from MODELHUB.md (or mock weights for speed)
  • [ ] Document setup requirements in CONTRIBUTING.md or README.md

🌿Good first issues

  • Add comprehensive unit tests for window attention mechanism in models/: currently no visible test files for shifted window partitioning logic, which is the core innovation. Create tests/test_window_attention.py validating partition/merge operations and attention mask correctness.
  • Document config parameter breakdown for each YAML file: no inline documentation in YAML files explaining what each hyperparameter does (e.g., why swin_base uses window_size=7 while swinv2_base uses window_size=12). Create configs/README.md with parameter glossary and comparison table across variants.
  • Add missing integration tests for model variants: verify that all 60+ configs in configs/ can actually load and forward-pass without errors. Create a test_config_loading.py script that iterates configs/, instantiates models, and runs dummy inputs; many variant configs may have stale/incompatible parameters.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • f82860b — Merge pull request #362 from zdaxie/main (impiga)
  • 8759d78 — update simmim pretrained swin v2 model paths (zdaxie)
  • a42ea9e — Merge branch 'microsoft:main' into main (zdaxie)
  • 968e6b5 — supporting pytorch 2.x (#346) (zeliu98)
  • 2cb103f — update azure paths of SimMIM ckpts (#334) (zdaxie)
  • 5758779 — update azure paths of SimMIM ckpts (zdaxie)
  • f92123a — Update README.md (ancientmooner)
  • ad1c947 — The codes and models of feature distillation (FD) are released (ancientmooner)
  • d19503d — Change default value of WARMUP_PREFIX to True (zeliu98)
  • 22e57f4 — Support warmup_prefix for CosineLRScheduler (#278) (zeliu98)

🔒Security observations

The Swin Transformer repository demonstrates generally good security practices with proper licensing, security reporting guidelines in SECURITY.md, and open-source transparency. However, security concerns exist around missing dependency manifests for version pinning, potential unsafe YAML parsing in configuration files, custom kernel compilation risks, and data loading path validation. The codebase appears to be a research/ML project with appropriate governance, but lacks explicit security implementation details. No critical vulnerabilities were identified in the provided file structure, but comprehensive code review of Python modules is recommended to rule out injection risks and unsafe deserialization patterns.

  • Medium · Missing Dependency Pinning in Package Management — Root directory - missing dependency manifest. No package dependency file (requirements.txt, setup.py, or pyproject.toml) was provided in the analysis. This makes it impossible to verify if all dependencies are pinned to specific versions, which could lead to supply chain attacks through unexpected updates of transitive dependencies. Fix: Create and maintain a requirements.txt or pyproject.toml file with pinned versions for all direct and critical transitive dependencies. Use tools like pip-audit to regularly scan for known vulnerabilities.
  • Low · Potential Unsafe YAML Configuration Parsing — config.py, configs/ directory files. The repository contains multiple YAML configuration files in configs/ directory. If these are parsed using unsafe YAML loaders (e.g., yaml.load() instead of yaml.safe_load()), it could allow arbitrary code execution when loading malicious configuration files. Fix: Ensure all YAML parsing uses yaml.safe_load() instead of yaml.load(). Validate configuration file sources and restrict write permissions to config files.
  • Low · Custom CUDA Kernel Compilation — kernels/window_process/ directory. The repository includes C++ and CUDA kernel source files (swin_window_process.cpp, swin_window_process_kernel.cu) that are compiled as part of the build process. If these kernels are downloaded or updated from untrusted sources, or if the build environment is compromised, malicious code could be injected. Fix: Verify kernel source integrity using checksums or digital signatures. Document the kernel compilation process. Use reproducible builds where possible. Restrict build environment access.
  • Low · Potential Path Traversal in Data Loading — data/cached_image_folder.py, data/zipreader.py. Data loading modules (data/cached_image_folder.py, data/zipreader.py) process file paths from external sources. If not properly validated, this could lead to path traversal vulnerabilities allowing access to files outside intended directories. Fix: Implement strict path validation using os.path.abspath() and ensure all paths resolve within expected base directories. Use pathlib.Path for safer path operations.
  • Low · Unsafe Pickle Usage Potential — models/ directory (potential - requires code review). Machine learning projects often use pickle for model serialization. If the codebase deserializes pickle objects from untrusted sources without validation, it could lead to arbitrary code execution. Fix: If pickle is used for model loading, implement strict validation of model sources. Consider using safer serialization formats like SafeTensors or ONNX. Never unpickle untrusted data.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Mixed signals · microsoft/Swin-Transformer — RepoPilot