Repository Management: LFS, Permissions & Recovery
Manage large files with Git LFS, scale repositories with Scalar, configure permissions and tags, and recover or purge data using Git commands.
Why Repository Management Matters
Think of a warehouse.
A small shop keeps everything on shelves — easy to find, quick to access. But when the shop grows into a massive warehouse, you need systems: large items go in special storage (LFS), access badges control who enters which area (permissions), labels on shelves help you find things (tags), and there’s a process for recovering dropped items or disposing of expired stock (recovery and purging).
Repository management is warehouse logistics for your code. As repositories grow in size, contributors, and history, you need strategies to keep them fast, secure, and organised.
Git Large File Storage (LFS)
Git LFS replaces large files in your repository with small pointer files while storing the actual file content on a separate LFS server. When you clone or checkout, Git LFS downloads only the large files you need for your current branch.
How Git LFS Works
Without LFS:
repo (5GB) = code (50MB) + large files full history (4.95GB)
Every clone downloads 5GB
With LFS:
repo (50MB) = code (50MB) + pointer files (few KB)
LFS server stores actual large files
Clone downloads 50MB + only current version of needed large files
Setup:
- Install Git LFS:
git lfs install - Track file patterns:
git lfs track "*.psd"(updates.gitattributes) - Commit the
.gitattributesfile - Add and commit large files normally — Git LFS intercepts and replaces them with pointers
What a pointer file looks like:
version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345678
When to Use Git LFS
| File Type | Use LFS? | Why |
|---|---|---|
| PSD/AI design files | Yes | Large, binary, change frequently |
| Video/audio files | Yes | Large, binary |
| Compiled binaries (DLLs, JARs) | Yes | Binary, shouldn’t be in source anyway — consider packages instead |
| ML model files | Yes | Often 100MB+ |
| SQLite database files | Yes | Binary format, large |
| Images for documentation | Maybe | Small PNGs are fine in Git; large PSDs need LFS |
| Source code | No | Text files are what Git does best |
| Configuration files (JSON, YAML) | No | Small text files |
☁️ Jordan’s LFS Strategy
Jordan at Cloudstream Media manages a repo with video processing pipelines. The repo contains test video files (500MB each) for integration testing.
Jordan configures LFS tracking:
git lfs track "*.mp4"
git lfs track "*.mov"
git lfs track "*.psd"
git lfs track "models/*.bin"
Clone time dropped from 45 minutes to 3 minutes. Developers only download the video files they need for their current branch.
git-fat: A Lightweight Alternative
git-fat is a simpler alternative to Git LFS that stores large files in any rsync-accessible location (including S3, network drives, or cloud storage).
| Aspect | Git LFS | git-fat |
|---|---|---|
| Server requirement | Dedicated LFS server (GitHub, Azure Repos, GitLab include one) | Any rsync-accessible storage |
| Protocol | Custom LFS API over HTTP | rsync |
| Hosting support | GitHub, Azure Repos, GitLab, Bitbucket | Self-hosted storage only |
| Maintenance | Managed by hosting platform | Self-managed |
| Best for | Teams using hosted Git platforms | Teams needing custom storage backends |
Exam Tip: git-fat on the Exam
git-fat appears in the AZ-400 objectives but is rarely the correct answer. The exam typically tests whether you know it exists as an alternative to Git LFS for scenarios where you need custom storage backends. If the question mentions GitHub or Azure Repos, Git LFS is always the answer. git-fat is the answer only when the scenario requires self-managed storage or rsync-based transfer.
Scalar: Scaling Massive Repositories
Scalar is a tool from Microsoft (originally developed for the Windows OS repository — 300GB, 3.5 million files) that makes Git faster on large repositories without changing your workflow.
What Scalar does:
- Partial clone — clone without downloading all file contents (blobs downloaded on demand)
- Sparse checkout — only materialise the files and folders you need in your working directory
- Background maintenance — prefetch commits and run
git maintenanceautomatically - File system monitor — uses OS-level file watching instead of scanning all files for changes
- Commit graph — pre-computes commit relationships for faster log and blame operations
Setup:
scalar clone https://dev.azure.com/org/project/_git/huge-repo
Scalar wraps a normal git clone but enables all the optimisations automatically.
Cross-Repository Sharing
When multiple repositories need shared code, you have several options:
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Git Submodules | Embeds one repo inside another as a pointer to a specific commit | Exact version pinning; independent repos | Complex update workflow; nested clone issues; confusing for beginners |
| Git Subtrees | Copies another repo's content into a subdirectory with merged history | Simpler than submodules; works with normal Git commands | History pollution; manual sync required |
| Package managers (NuGet, npm) | Publish shared code as a package; consume via dependency | Clean separation; semantic versioning; standard tooling | Requires package registry; more setup; release process needed |
| Monorepo | All code in one repository with build system managing projects | Single source of truth; atomic cross-project changes | Requires Scalar-level tooling at scale; long CI times without optimisation |
☁️ Jordan’s Recommendation
Jordan recommends package managers for most teams: “Submodules are a footgun for anyone who doesn’t live in the terminal. Publish shared libraries as packages — NuGet for .NET, npm for Node, PyPI for Python. Pin versions, test independently, update deliberately.”
For Cloudstream’s internal Bicep modules, Jordan uses Azure Container Registry as a Bicep module registry — each module is versioned and consumed by reference.
Repository Permissions
Azure Repos Permissions
Azure Repos uses a granular permission model at multiple levels:
| Level | Permissions Available |
|---|---|
| Organisation | Create repositories, manage repository policies |
| Project | Read, contribute, create branches, manage permissions |
| Repository | Read, contribute, create branch, create tag, manage notes, bypass policies, force push, edit policies |
| Branch | Per-branch permissions (contribute, force push, bypass policies) |
Key groups: Project Administrators, Contributors, Readers, Build Service (pipeline identity)
Important: The Build Service account needs explicit contribute permissions to push tags or update branches from pipelines.
GitHub Repository Roles
| Role | Capabilities |
|---|---|
| Read | View code, open issues, comment |
| Triage | Manage issues and PRs (label, assign, close) without code access |
| Write | Push code, manage branches, merge PRs |
| Maintain | Manage repository settings (except destructive actions) |
| Admin | Full access including settings, secrets, branch protection, delete |
GitHub Teams: Organise users into teams with role assignments. Teams can be nested (parent/child) for hierarchical access.
CODEOWNERS: Adds per-path reviewer requirements (covered in Module 5) — not technically permissions but functionally enforces who must review changes.
Exam Tip: Least Privilege Principle
The exam frequently tests the principle of least privilege. When asked which permission level to grant:
- Developers who push code: Write (GitHub) or Contributor (Azure Repos)
- CI/CD pipeline service accounts: Contributor with specific branch permissions
- QA team that only manages issues: Triage (GitHub) — a commonly missed role
- Project managers who view dashboards: Read (both platforms)
Never grant Admin when Write or Maintain would suffice. The exam penalises over-permissioning.
Tags: Organising the Repository
Tags mark specific commits as significant — typically releases.
Lightweight vs Annotated Tags
| Type | Command | What It Stores | Use When |
|---|---|---|---|
| Lightweight | git tag v1.0 | Just a pointer to a commit (like a branch that doesn’t move) | Quick, informal markers |
| Annotated | git tag -a v1.0 -m "Release 1.0" | Full Git object with tagger name, email, date, and message | Production releases — includes metadata for auditing |
Annotated tags are recommended for releases because they include:
- Who created the tag
- When it was created
- A message explaining the release
- Can be GPG-signed for verification
Tag Naming Conventions
- Semantic versioning:
v1.2.3(major.minor.patch) - Pre-release:
v2.0.0-beta.1,v2.0.0-rc.1 - Date-based:
release-2026-04-15(for teams without semver) - Environment-based: avoid — tags should mark versions, not environments
Data Recovery with Git Commands
Git’s reflog is your safety net. It records every HEAD movement — even ones that don’t appear in git log.
git reflog
git reflog shows a log of where HEAD has pointed. Even if you force-push, reset, or rebase away commits, the reflog remembers.
git reflog
# a1b2c3d HEAD@{0}: reset: moving to HEAD~3
# e4f5g6h HEAD@{1}: commit: Add feature X
# i7j8k9l HEAD@{2}: commit: Fix bug Y
# Recover the lost commits:
git checkout e4f5g6h
# or
git cherry-pick e4f5g6h
# or
git reset --hard e4f5g6h
Important: The reflog is local only — it’s not pushed to remotes. Entries expire after 90 days for reachable refs and 30 days for unreachable refs (both configurable).
Common Recovery Scenarios
| Scenario | Recovery Command |
|---|---|
| Accidentally reset to wrong commit | git reflog then git reset --hard HEAD@{N} |
| Deleted a branch with unmerged work | git reflog then git checkout -b recovered-branch COMMIT_SHA |
| Need a specific commit from another branch | git cherry-pick COMMIT_SHA |
| Reverted a merge and need to undo the revert | git revert REVERT_COMMIT_SHA (revert the revert) |
| Lost stashed changes | git stash list then git stash apply stash@{N} |
Removing Data from Source Control
Sometimes you need to permanently remove data — leaked credentials, accidentally committed large files, or sensitive data that should never have been pushed.
git filter-repo (Recommended)
git filter-repo is the modern, officially recommended tool for rewriting Git history. It replaced the older git filter-branch.
# Remove a specific file from all history
git filter-repo --path secrets.json --invert-paths
# Remove files larger than 10MB from all history
git filter-repo --strip-blobs-bigger-than 10M
# Replace text in all files across all history
git filter-repo --replace-text expressions.txt
BFG Repo-Cleaner
BFG is an older but still popular alternative — faster than git filter-branch but less flexible than git filter-repo.
# Remove files larger than 100MB from history
bfg --strip-blobs-bigger-than 100M
# Remove a specific file from all history
bfg --delete-files secrets.json
# Replace passwords in all history
bfg --replace-text passwords.txt
After rewriting history:
- Force-push to the remote:
git push --force --all - Force-push tags:
git push --force --tags - All collaborators must re-clone (their local history is now divergent)
- If credentials were leaked, rotate them immediately — rewriting history doesn’t revoke access
Exam Tip: Leaked Credentials
If the exam asks what to do when credentials are accidentally committed to a public repository:
- Rotate the credentials immediately — this is step one, before any history cleanup
- Remove the file from the working directory and commit
- Use git filter-repo or BFG to purge the file from all history
- Force-push to overwrite remote history
- Contact GitHub support to clear cached views (if public repo)
- Enable secret scanning to prevent future leaks
The key insight: rewriting history removes the file from Git but anyone who already cloned still has it. The credential must be rotated regardless.
Jordan's repository has grown to 8GB because developers committed large video test files directly (without LFS) over the past year. Clone times are unacceptable. What should Jordan do?
A developer accidentally committed an API key to a public GitHub repository 3 hours ago. Multiple people have already cloned the repository. What is the FIRST action to take?
Chen (SRE at Cloudstream) needs to mark a specific commit as the v3.0 production release with metadata including who approved it and a GPG signature. Which Git command should Chen use?
🎬 Video coming soon
Repository Management Deep Dive
Next up: Design and Implement Build and Release Pipelines — Domain 3 starts with testing strategies and pipeline fundamentals.