Repository Management: LFS, Permissions & Recovery

Why Repository Management Matters

Simple explanation

Think of a warehouse.

A small shop keeps everything on shelves — easy to find, quick to access. But when the shop grows into a massive warehouse, you need systems: large items go in special storage (LFS), access badges control who enters which area (permissions), labels on shelves help you find things (tags), and there’s a process for recovering dropped items or disposing of expired stock (recovery and purging).

Repository management is warehouse logistics for your code. As repositories grow in size, contributors, and history, you need strategies to keep them fast, secure, and organised.

Git Large File Storage (LFS)

Git LFS replaces large files in your repository with small pointer files while storing the actual file content on a separate LFS server. When you clone or checkout, Git LFS downloads only the large files you need for your current branch.

How Git LFS Works

Without LFS:
  repo (5GB) = code (50MB) + large files full history (4.95GB)
  Every clone downloads 5GB

With LFS:
  repo (50MB) = code (50MB) + pointer files (few KB)
  LFS server stores actual large files
  Clone downloads 50MB + only current version of needed large files

Setup:

Install Git LFS: git lfs install
Track file patterns: git lfs track "*.psd" (updates .gitattributes)
Commit the .gitattributes file
Add and commit large files normally — Git LFS intercepts and replaces them with pointers

What a pointer file looks like:

version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345678

When to Use Git LFS

File Type	Use LFS?	Why
PSD/AI design files	Yes	Large, binary, change frequently
Video/audio files	Yes	Large, binary
Compiled binaries (DLLs, JARs)	Yes	Binary, shouldn’t be in source anyway — consider packages instead
ML model files	Yes	Often 100MB+
SQLite database files	Yes	Binary format, large
Images for documentation	Maybe	Small PNGs are fine in Git; large PSDs need LFS
Source code	No	Text files are what Git does best
Configuration files (JSON, YAML)	No	Small text files

☁️ Jordan’s LFS Strategy

Jordan at Cloudstream Media manages a repo with video processing pipelines. The repo contains test video files (500MB each) for integration testing.

Jordan configures LFS tracking:

git lfs track "*.mp4"
git lfs track "*.mov"
git lfs track "*.psd"
git lfs track "models/*.bin"

Clone time dropped from 45 minutes to 3 minutes. Developers only download the video files they need for their current branch.

Question

What file does 'git lfs track' modify, and what does it do?

Click or press Enter to reveal answer

Answer

It modifies the .gitattributes file, adding patterns that tell Git LFS to manage matching files. For example, 'git lfs track *.psd' adds '*.psd filter=lfs diff=lfs merge=lfs -text' to .gitattributes. This file must be committed to the repository so all collaborators use LFS for the same file types.

Click to flip back

git-fat: A Lightweight Alternative

git-fat is a simpler alternative to Git LFS that stores large files in any rsync-accessible location (including S3, network drives, or cloud storage).

Aspect	Git LFS	git-fat
Server requirement	Dedicated LFS server (GitHub, Azure Repos, GitLab include one)	Any rsync-accessible storage
Protocol	Custom LFS API over HTTP	rsync
Hosting support	GitHub, Azure Repos, GitLab, Bitbucket	Self-hosted storage only
Maintenance	Managed by hosting platform	Self-managed
Best for	Teams using hosted Git platforms	Teams needing custom storage backends

Exam Tip: git-fat on the Exam

git-fat appears in the AZ-400 objectives but is rarely the correct answer. The exam typically tests whether you know it exists as an alternative to Git LFS for scenarios where you need custom storage backends. If the question mentions GitHub or Azure Repos, Git LFS is always the answer. git-fat is the answer only when the scenario requires self-managed storage or rsync-based transfer.

Scalar: Scaling Massive Repositories

Scalar is a tool from Microsoft (originally developed for the Windows OS repository — 300GB, 3.5 million files) that makes Git faster on large repositories without changing your workflow.

What Scalar does:

Partial clone — clone without downloading all file contents (blobs downloaded on demand)
Sparse checkout — only materialise the files and folders you need in your working directory
Background maintenance — prefetch commits and run git maintenance automatically
File system monitor — uses OS-level file watching instead of scanning all files for changes
Commit graph — pre-computes commit relationships for faster log and blame operations

Setup:

scalar clone https://dev.azure.com/org/project/_git/huge-repo

Scalar wraps a normal git clone but enables all the optimisations automatically.

Question

What is Scalar and when should you use it?

Click or press Enter to reveal answer

Answer

Scalar is a Microsoft tool that optimises Git for very large repositories. It enables partial clone, sparse checkout, background maintenance, and file system monitoring. Use it when your repository is large enough that normal git operations (clone, status, checkout) are slow — typically repositories with 100K+ files or 10GB+ history. It was built for the Windows OS repo (300GB, 3.5M files).

Click to flip back

When multiple repositories need shared code, you have several options:

Cross-Repository Sharing Strategies
Strategy	How It Works	Pros	Cons
Git Submodules	Embeds one repo inside another as a pointer to a specific commit	Exact version pinning; independent repos	Complex update workflow; nested clone issues; confusing for beginners
Git Subtrees	Copies another repo's content into a subdirectory with merged history	Simpler than submodules; works with normal Git commands	History pollution; manual sync required
Package managers (NuGet, npm)	Publish shared code as a package; consume via dependency	Clean separation; semantic versioning; standard tooling	Requires package registry; more setup; release process needed
Monorepo	All code in one repository with build system managing projects	Single source of truth; atomic cross-project changes	Requires Scalar-level tooling at scale; long CI times without optimisation

☁️ Jordan’s Recommendation

Jordan recommends package managers for most teams: “Submodules are a footgun for anyone who doesn’t live in the terminal. Publish shared libraries as packages — NuGet for .NET, npm for Node, PyPI for Python. Pin versions, test independently, update deliberately.”

For Cloudstream’s internal Bicep modules, Jordan uses Azure Container Registry as a Bicep module registry — each module is versioned and consumed by reference.

Question

What is the difference between Git submodules and Git subtrees?

Click or press Enter to reveal answer

Answer

Submodules embed a reference (pointer) to a specific commit in another repository — the content stays in the external repo and is cloned separately. Subtrees copy the external repo's content directly into a subdirectory with merged history — the content lives in your repository. Submodules are precise but complex. Subtrees are simpler but pollute your history. For most teams, package managers are preferred over both.

Click to flip back

Repository Permissions

Azure Repos Permissions

Azure Repos uses a granular permission model at multiple levels:

Level	Permissions Available
Organisation	Create repositories, manage repository policies
Project	Read, contribute, create branches, manage permissions
Repository	Read, contribute, create branch, create tag, manage notes, bypass policies, force push, edit policies
Branch	Per-branch permissions (contribute, force push, bypass policies)

Key groups: Project Administrators, Contributors, Readers, Build Service (pipeline identity)

Important: The Build Service account needs explicit contribute permissions to push tags or update branches from pipelines.

GitHub Repository Roles

Role	Capabilities
Read	View code, open issues, comment
Triage	Manage issues and PRs (label, assign, close) without code access
Write	Push code, manage branches, merge PRs
Maintain	Manage repository settings (except destructive actions)
Admin	Full access including settings, secrets, branch protection, delete

GitHub Teams: Organise users into teams with role assignments. Teams can be nested (parent/child) for hierarchical access.

CODEOWNERS: Adds per-path reviewer requirements (covered in Module 5) — not technically permissions but functionally enforces who must review changes.

Exam Tip: Least Privilege Principle

The exam frequently tests the principle of least privilege. When asked which permission level to grant:

Developers who push code: Write (GitHub) or Contributor (Azure Repos)
CI/CD pipeline service accounts: Contributor with specific branch permissions
QA team that only manages issues: Triage (GitHub) — a commonly missed role
Project managers who view dashboards: Read (both platforms)

Never grant Admin when Write or Maintain would suffice. The exam penalises over-permissioning.

Tags: Organising the Repository

Tags mark specific commits as significant — typically releases.

Lightweight vs Annotated Tags

Type	Command	What It Stores	Use When
Lightweight	`git tag v1.0`	Just a pointer to a commit (like a branch that doesn’t move)	Quick, informal markers
Annotated	`git tag -a v1.0 -m "Release 1.0"`	Full Git object with tagger name, email, date, and message	Production releases — includes metadata for auditing

Annotated tags are recommended for releases because they include:

Who created the tag
When it was created
A message explaining the release
Can be GPG-signed for verification

Tag Naming Conventions

Semantic versioning: v1.2.3 (major.minor.patch)
Pre-release: v2.0.0-beta.1, v2.0.0-rc.1
Date-based: release-2026-04-15 (for teams without semver)
Environment-based: avoid — tags should mark versions, not environments

Question

What is the difference between a lightweight tag and an annotated tag in Git?

Click or press Enter to reveal answer

Answer

A lightweight tag is just a pointer to a commit — like a bookmark. It stores no metadata. An annotated tag is a full Git object that stores the tagger's name, email, date, and a message. Annotated tags can be GPG-signed. Always use annotated tags for releases (git tag -a v1.0 -m 'message') because they provide an audit trail of who tagged what and when.

Click to flip back

Data Recovery with Git Commands

Git’s reflog is your safety net. It records every HEAD movement — even ones that don’t appear in git log.

git reflog

git reflog shows a log of where HEAD has pointed. Even if you force-push, reset, or rebase away commits, the reflog remembers.

git reflog
# a1b2c3d HEAD@{0}: reset: moving to HEAD~3
# e4f5g6h HEAD@{1}: commit: Add feature X
# i7j8k9l HEAD@{2}: commit: Fix bug Y

# Recover the lost commits:
git checkout e4f5g6h
# or
git cherry-pick e4f5g6h
# or
git reset --hard e4f5g6h

Important: The reflog is local only — it’s not pushed to remotes. Entries expire after 90 days for reachable refs and 30 days for unreachable refs (both configurable).

Common Recovery Scenarios

Scenario	Recovery Command
Accidentally reset to wrong commit	`git reflog` then `git reset --hard HEAD@{N}`
Deleted a branch with unmerged work	`git reflog` then `git checkout -b recovered-branch COMMIT_SHA`
Need a specific commit from another branch	`git cherry-pick COMMIT_SHA`
Reverted a merge and need to undo the revert	`git revert REVERT_COMMIT_SHA` (revert the revert)
Lost stashed changes	`git stash list` then `git stash apply stash@{N}`

Question

How do you recover a commit that was lost after a git reset --hard?

Click or press Enter to reveal answer

Answer

Use 'git reflog' to find the SHA of the lost commit — reflog records every HEAD movement including resets. Then either: 'git reset --hard SHA' to move HEAD back, 'git cherry-pick SHA' to apply just that commit, or 'git checkout -b recovery-branch SHA' to create a new branch at that point. Note: reflog is local only and entries expire after 90 days.

Click to flip back

Removing Data from Source Control

Sometimes you need to permanently remove data — leaked credentials, accidentally committed large files, or sensitive data that should never have been pushed.

git filter-repo (Recommended)

git filter-repo is the modern, officially recommended tool for rewriting Git history. It replaced the older git filter-branch.

# Remove a specific file from all history
git filter-repo --path secrets.json --invert-paths

# Remove files larger than 10MB from all history
git filter-repo --strip-blobs-bigger-than 10M

# Replace text in all files across all history
git filter-repo --replace-text expressions.txt

BFG Repo-Cleaner

BFG is an older but still popular alternative — faster than git filter-branch but less flexible than git filter-repo.

# Remove files larger than 100MB from history
bfg --strip-blobs-bigger-than 100M

# Remove a specific file from all history
bfg --delete-files secrets.json

# Replace passwords in all history
bfg --replace-text passwords.txt

After rewriting history:

Force-push to the remote: git push --force --all
Force-push tags: git push --force --tags
All collaborators must re-clone (their local history is now divergent)
If credentials were leaked, rotate them immediately — rewriting history doesn’t revoke access

Exam Tip: Leaked Credentials

If the exam asks what to do when credentials are accidentally committed to a public repository:

Rotate the credentials immediately — this is step one, before any history cleanup
Remove the file from the working directory and commit
Use git filter-repo or BFG to purge the file from all history
Force-push to overwrite remote history
Contact GitHub support to clear cached views (if public repo)
Enable secret scanning to prevent future leaks

The key insight: rewriting history removes the file from Git but anyone who already cloned still has it. The credential must be rotated regardless.

Knowledge Check

Jordan's repository has grown to 8GB because developers committed large video test files directly (without LFS) over the past year. Clone times are unacceptable. What should Jordan do?

Knowledge Check

A developer accidentally committed an API key to a public GitHub repository 3 hours ago. Multiple people have already cloned the repository. What is the FIRST action to take?

Knowledge Check

Chen (SRE at Cloudstream) needs to mark a specific commit as the v3.0 production release with metadata including who approved it and a GPG signature. Which Git command should Chen use?

Next up: Design and Implement Build and Release Pipelines — Domain 3 starts with testing strategies and pipeline fundamentals.

Why Repository Management Matters

Git Large File Storage (LFS)

How Git LFS Works

When to Use Git LFS

☁️ Jordan’s LFS Strategy

git-fat: A Lightweight Alternative

Scalar: Scaling Massive Repositories

Cross-Repository Sharing

☁️ Jordan’s Recommendation

Repository Permissions

Azure Repos Permissions

GitHub Repository Roles

Tags: Organising the Repository

Lightweight vs Annotated Tags

Tag Naming Conventions

Data Recovery with Git Commands

git reflog

Common Recovery Scenarios

Removing Data from Source Control

git filter-repo (Recommended)

BFG Repo-Cleaner