Git Working
1) Internals of Git
1.1) General types of Git commands
1. High level user commands called "Porcelain" Commands, such as
Basic - git add, git commit
For Remote repos - git push, git pull
To work with Branches - git branch, git checkout, git merge, git rebase
…
2. Low level commands called "Plumbing" Commands, such as
git cat-file, git hash-object, git count-objects, …
- These are the Basic building breaks that Porcelain commands are build-up on.
- May only required for advanced git scripting, otherwise not required
- Understanding of these commands can be hard, some may be confusing; but useful to understand the model of Git, rather than leaning low level commands
1.2) Git is a Distributed Revision Control System
= Distributed + a Revision Control System
= Distributed + Branch/History of file + a Stupid Content Tracker for file or directories
= Distributed + Branch/History of file + Tracking files, Notion of commit or version + at core it is a Persistent Map (i.e. a structure that maps keys-values which is persistent, stored on disk)
Git = a Map --> a Persistent Map ---> a Content Tracker (with Versioning)
1.2.1) a Map
#) It has table with keys & values
- values are sequences of bytes (e.g. from content of file or binary file)
- key is a hash calculated from value, uses SHA1 algorithm. Every content has its SHA1-hash, exactly one. They are 20 bytes long.
Low Level Command : git hash-object <file/directory/--stdin> [-w]
‘hash-object’ to generate hash. ‘--stdin’ to take input from standard input using echo instead of file/directory
options : -w, to make content persistent by writing it in git repository (i.e. Store in object database)
e.g. echo "Hello world!" | git hash-object --stdin
e.g. git hash-object file_path
#) Every object in Git repository has its own SHA1-hash.
#) Number of possible SHA1-hash are large, but not infinite. Even though, it is unlike to happen that they can collide (2 different contents have same SHA1-hash by chance). Chances of 2 SHA1s are colliding are not likely to happen.
#) SHA1-hash are unique, not only within a project but in universe. It may lead some performance problem, but still not collision.
1.2.2) a Persistent Map
#) Besides Generating SHA1-hash, Git also Saves the content in repository. [ 'git init' can be used for creating (empty) repository and then storing/committing content in repo.]
e.g. echo "Hello world!" | git hash-object --stdin -w :
- This stores in object database (.git/objects/),
- By creating a sub-directory with name as 1st byte of SHA1-hash and a file in that sub-directory with name as remaining bytes of SHA1-hash
- This is blob of data. Blob is generic piece of content.
- Original content gets mangled inside the file, with a small header and compressing of content. So, can't open the file to read it. 'git cat-file' can be used for it.
#) Persistent Map is very basics of Git model
= Any content --> Generate key (SHA1-hash) for it --> Persist the content into repository as a blob
Low Level Command : git cat-file <SHA1-hash OR Tag-name> [-t -p]
options : -t, to print ONLY type of content (like blob)
-p, to print ONLY the actual content of object (it Unzips the object, Removes the header and Print content)
NOTE: Only first few digits of SHA1-hash also can be used to see content, unless there are multiple SHA1-hash in database from provided starting first few digits.
High Level Command : git init
to create git project i.e. to initialize a git repository
it does by creating '.git' sub-directory, which is a Non-empty and Hidden; but with empty 'object' database (apart from ‘info’ & ‘pack’ sub-directories). Commit or write adds in object database.
1.2.3) a Content Tracker
commit or saving in repo creates bunch of sub-directories in blob object database (.git/objects/)
commit is simple text file, file-commit-detail, which is also stored in object database same way as blob (Header + Compress content).
tree in commit points to the root directory of the project, file-commit-treeroot, which is also stored in object database same as blob.
→ ONE file-commit-details Header: commit. Compress content/metadata: hash of file-commit-treeroot, author & committer name+email+date+time, commit message.
→ file-commit-treeroot Header: treeroot. Compress content/metadata: list of hash for content (directories & files) in root directory with access permission & name.
[object database has list of all hash of Recursive sub-directories and files]
→ file-tree-subdir Header: tree. Compress content/metadata: list of hash for content (directories & files) in it with access permission & name.
→ file-blob Header: blob. Compress content: actual file content
NOTE: Git doesn’t create separate objects (blobs) for files having content (i.e. same hash). It reuses the existing object which is already in database. Similar is applicable for other categories of objects (for example directories).
It is better to understand that, commits blob or tree as a just file (separate files), which are hashed and stored in database.
*** Git is efficient as it doesn’t store elements more than once.
NOTE: Though Git stores new blob every time there is change in a file, there are few more optimization being performed to save more space (for example in case of huge file having single line change).
As the optimization steps, sometime Git might decide to store only difference between 2 files or even compress multiple objects in same physical file. For same purpose there are ‘info’ & ‘pack’ directories in object database.
NOTE: Names of blobs and tress are not stored in object itself, instead they are stored in parent tree. So same object (i.e. blob/tree) can be pointed by different trees with different names.
High Level Command : git status
to see file & folders in project.
It lists files/folders which are in,
Untracked list (changed, not added; so, Git doesn’t know what to do with them) and
Staging area (added, but not committed i.e. Changes to be committed)
High Level Command : git add [files/folder]
to add file & folder in Staging area from Untracked list
High Level Command : git commit [-m commit_message]
to commit and clear Staging area
High Level Command : git log
to list all existing commits
1.2.3.1) Versioning in Git : Commits (i.e. file-commit-details) are linked, except 1st commit.
→ (Except 1st) file-commit-details Header: commit. Compress content: additional parent link to previous commit hash (previous file-commit-details)
→ (Another) file-commit-treeroot based on the updates
Low Level Command : git count-objects
to display total number of objects in git database and space utilization by all objects (in kilobytes).
1.2.3.3) Git content management is simple
- References between commit are used track history. All the other references are used to track content.
- As Git reuses content, so there can be objects which are reachable from more than one commit.
- During checkout, Git only check for trees and blobs in checkout commit from object database; and replaces to working directory content. But doesn’t care about history or doesn’t look at commits connected to each other. Same is applicable for commit during merge (merge-commit).
- Working area is list important part of project, all important data is stored in git database within ‘.git’ directory
1.2.3.3) Git Tag :
- One of the 4 types of objects in database, together with commits, tress and blobs
- It is Git Object, which is simple LABEL to current state of project and attached to existing an object
- 2 types of Tags: Annotated Tag with message and Regular Tag ( / Non-annotated / Lightweight)
- Tag object files gets created in ‘.git/refs/tags/’ directory
- Regular / Non-annotated / Lightweight Tag
- It is simple label without detailed information
- Reference to an object (can be to commit just like a branch)
- Too simple file, containing SHA1 of (commit) object in database
- Annotated Tag
- Annotated Tag is like Regular Tag, but contains SHA1 of tag object in database, and tag object in term is reference to a commit, containing all extra information
- Its creation is like create a commit; also, it is an object like commit object
- Metadata: object & type (e.g. commit) to which tag is pointing to, tag-name, tagger name+email+date+time, tag-message
- Tags look just like Branches = A Tag is like a Branch that doesn’t move
New Commit created move to track the changes of CURRENT branch. But Tag stick to the same object forever.
Low Level Command: git tag [-a <name-of-tag> -m <tag-message>]
to show current tag name
options: -a, to create Annotated Tag with specified name. Without it creates Regular Tag
-m, to provide tag message
1.2.4) a Revision Control System
Branches, Merges, Rebases, Tags are main features of Git from a Content Tracker to a Revision Control System
1.3) Git Objects in Database :
- Blobs arbitrary contents, which contain data
- Trees equivalents to directories, which contain blobs & other trees (recursive)
- Commits
- (Annotated) Tags
Git is a high-level file system (Version file system) over native file system, with
# Files/content as Blobs,
# Directories/nested containers as Trees, and
# Links as same file or directory can be reached from different places with different names
2) Git Branches
2.1) a Content Tracker with multiple branches
- git creates default branch, i.e. master branch, only after first commit. No user specified branch can be created until that.
- A branch is simple reference / pointer to a commit : Each branch stores hash of last commit in an uncompressed file with branchname at .git/refs/heads/ (say branchfile). Branch = references / pointer to commits
- That branchfile can be directly read, using ‘cat’ command. Even one can delete or rename branch by deleting or renaming the branchfile. Also can create a new branch by writing a new branchfile with hash of any commit in that folder. NOTE Git has provided high level commands for these operations.
- Newly created branch will have exactly same content as that of CURRENT branch (i.e. exactly same hash / commit object). So that both branches (CURRENT branch & new branch) points to same commit, after creation of new branch.
High Level Command : git branch [branchname] [--all ]
to list of Local branches. Current branch is marked with *
branchname: to JUST create new branch with provided name, but doesn’t switch to that branch
options: --all, to list all references including Remote and current position of HEAD
2.1.1) Current Branch
- /.git/HEAD contains reference to a file (i.e. filepath), representing the CURRENT branchfile. It is kind of pointer to pointer (branchfile pointing to a commit, while HEAD pointing to the branchfile).
- HEAD is reference to CURRENT branch or commit (in case of Detached HEAD) : Only one HEAD file present in git repo, so a repo can have only one CURRENT branch.
2.1.2) Internal steps during commit to branch
- git creates & add few new objects in object database for new commit; including its commit object itself with previous commit as its parent.
- Then looks in HEAD file to find CURRENT branch i.e. branchfile and then move that branch to point to new commit; while other branches / branchfiles are not touched / changed.
This leads CURRENT branch moves, but HEAD doesn’t change; and so HEAD moves along with CURRENT branch (as it is pointer to CURRENT branchfile always).
High Level Command : git checkout [-b] <other-BranchName or some-CommitSHA1>
to switch to other existing branch/commit, i.e. changing HEAD other branch/commit
options : -b, to create non-existing branch and then switch to it
#) Detached HEAD = checkout a commit of a branch instead of branch checkout
- HEAD not pointing to any branch, rather it points directly to a commit
- Also, there will not be any current branch (HEAD will be detached and not on any branch, but on commit of some branch)
#) Detached Branch = Garbage Collected
- Occurs after Detached SH1 (Detached HEAD) branch switched to another branch
- Then detached branch can’t be reached by any reference (i.e. current branch or HEAD or tag); so get removed to save disk space.
- Detached HEAD or Branch can be saved using ‘git branch [branch-name]’
- Detached HEAD → do experiment → commit experiment as required → decide to keep experiment in branch or ignore it
2.1.3) Internal steps during branch checkout = move HEAD and update working area
- git changes HEAD to point to other branchfile
- git replaces files & folders in working area / directory with files & folders in commit pointed by checkout branch (i.e. from last commit of checkout branch)
High Level Command : git merge <branchname-to-merge-from>
to merge changes of merge-from-branch in CURRENT branch
(to have changes from both branch together)
2.1.4) Merging of Branches
2.1.4.1) Internal steps during merging of branches , when no conflicts
- ‘git megre’ automatically merge changes from another specified branch in CURRENT branch
- Also performs new commit automatically with option to change default merge log message, if not specified.
- merge-commit object is same as that of other commit object, except merge-commit has 2 parents (last commit from both merged branches) unlike normal-commit has only one parent. [NOTE: in general a commit object can have any number of parents]
- Move CURRENT branch pointer to new commit, and so HEAD too.
2.1.4.2) Internal steps during merging of branches , on conflicts
- When atleast some of the changes from both branch conflicts, it need to solve conflicts manually. So Automatic Merge operation fails and shows conflict files names.
- ‘git status’ list conflict files as Untracked / unmerged Each conflict file shows overlapping / divergent changes from both branches. Manually needs to resolve conflicts.
- After conflicts are resolved, explicitly need to run ‘git add’ for conflict files. And then run ‘git commit <>’ to notify that conflict has fixed.
- During these all steps, git knows this state as middle-of-merge
2.1.4.3) Merging of recently merged branches in other/either way = Fast-forward
- Object required by this commit already present in latest merge-commit; So, Git reuses that merge-commit itself, as that merge-commit has latest version of all the content from both branch.
- Then it just moves branch to point to the same merge-commit of other branch; and so, HEAD.
2.1.5) Git Object Model
Git repo = bunch of Git objects linked to each other in a graph (Commits, Blobs, Trees or Tags)
Git Branches = references to commits
Git HEAD = only one of it, reference to one of the Git branch/commit as detached. A mark to current position in the graph
Current branch tracks new commits (Git commit or Git merge move current branch to new commit. In case of Detached HEAD, iself moved to new commit)
Working directory is updated automatically
Any commit (blob/tree) that can’t be reached either by branch, HEAD or Tag is Garbage Collected
3) Git Rebase
- Branching & Merging are standard operations for any revision system. But Rebasing is very less common and Git is popular for Rebase operation.
- It is Git signature feature
High Level Command : git rebase <branchname-to-base-as>
to set base-branch as base to CURRENT branch
(to have all commit and respective commit-histories from base-as branch before CURRENT branch)
3.1) Rebase of Branches
- Like merge operation; rebase put all the commits from both branches in same history
- Unlike merge operation keeps multiple branches together; rebase re-arranges branches to look like a single branch
3.1.1) Internal steps during rebase of branches , when no conflicts
- Looks for latest common commit both branches (CURRENT and base-as-branch)
- Illusion: Detaches CURRENT branch from that common commit / base
- Actual: Creates copies commits = New commit objects with mostly same data, but updated parent, due to SHA1 i.e. new files with new filenames in database directory
- Illusion: Moves it on top of other base-as-branch; so base of CURRENT branch gets changed to base-as-branch, with respective update in history
- Actual: Git moves the branch to new commits, leaving earlier commits as it is
Only branch pointing to earlier commits gets moved to new commit. So earlier commits are impossible to reach, without knowledge of their SHA1.
Depending on case, if earlier commits are not useful, no branch pointing at them; then they Garbage Collected.
So Rebasing operation creates new commits.
3.1.1.1) Garbage Collection = Git Garbage-collects unreachable objects
- Whenever a command (e.g. git checkout to Detached HEAD, git rebase) likely to generate unreachable commits, Git takes some time to look objects in database and identify unreachable objects (commits/blobs/trees) and delete them.
- Git doesn’t waste space of commits which can’t be reached by any reference. They are considered as dead and removed by Garbage Collector.
3.1.2) Rebasing of recently rebased branches in other/either way = Fast-forward
Rebase or merge of recently rebased or merged branches in other/either way performs Fast-Forward operation
3.2) Git Merge v/s Git Rebase
Git Merge = Preserves history as exactly it happened
- During merge operation, Git only writes merge log and ignores history of earlier commits of another branch.
- That can be confusing and Git log can be misleading. Merge log shows history in single long timeline with new commit (one commit after other). But merge history can be expressed as a graph, rather than in a line.
Git Rebase = Rebases refactor history
- History looks simple and neat. Commits from both branches are arranged in single timeline. Project with more Rebase operations generally looks more streamline & clean wrt project history, than Project with more Merge operations.
- In contrast to merge, rebase changes project history. Commits created in parallel in separate branches get rebase in wrong sequential order. So Merge should be preferred than Rebase, as merges never lie.
Command : stree
Source Tree, to visualize git history
3.3) Git Tags <= 1.2.3.3
4) a Distributed Version Control
To share project across multiple systems
High Level Command : git clone <repo-address>
to get copy of project on local system and show files of only one branch (default master branch)
options : -b, to create non-existing branch and then switch to it
- This command perform Repo Copy by getting all project files & entire ‘.git/’ directory with its files too
4.1) Repo Copy = With ‘.git/’ directory copy of project as well as history gets available
- Create an empty directory for the project
- Copy only ‘.git/’ directory (containing all objects from object database) as entire repo from repo-address to empty directory. Doesn’t copy all project files. Clone copies objects from Remote Repo to Local Repo.
- Checkout ‘master’ branch, to rebuild branch files of project in working area (by default clone master branch; to clone other branch from remote repo need to give special commands. Working area in Git can be always rebuilt from content of ‘.git/’ directory)
- Adds few defaults lines to configuration of repository .git/config , which can remember others Remote copies of same repository. There can be as many Remotes as want.
- Defines a default remote and name it as “origin”. It points to URL that the repo is cloned
- Defines a local master branch which maps to Remote master branch
- Automatically update Remote references in .git/refs/remotes/origin/ ; while write some of the branches in .git/packed-refs file in compact formats
4.2) Multiple Repos
- As many clones as required is possible, synchronizing each other
- All clones are equally good. Repos distributed on different systems are peers
- Git is not like SubVersion or traditional revision control system which need centralized server.
- It is always better that every developer must synchronize with one centralized repo. One specific clone can be important as a reference repo (like on Github).
4.2.1) Remote Branches = All Local & Remote branches are references to commit and Git track all. “local” branch refers to same branches in “origin” (Remote)
References of Remote branches, tags and current HEAD pointer of ‘origin’ are written in .git/refs/ folder
Remote branches are tracked exactly like Local branches
Some Local & Remote branches are written in .git/packed-refs in compact format (THIS FILE IS NOT READABLE)
Git updates Remote branch, whenever Local branch is synchronized with remote
“local” branches are same as “origin” until there are no changes in “local” and those are not pushed
4.2.2) Git config
- .git/config file contains list of Remote & Local branches
- Git used it to know other repo(s) to synchronize with.
Low Level Command : git show-ref <branch-name>
to see which commit branch pointing at
4.3) Synchronizing Repos = Keep branches synchronized
- Gets same objects on all the clone. Copy missing objects from one repo to another.
High Level Command : git push [-f]
to send new objects and updated branch to origin. Update remote branches to align current state of origin
- Update commit of Remote master branch & Also update reference of ‘origin/master’ to Local master branch
options : -f, to force remote to take new objects & to change Remote history to pushed Local history
Force push may lead to lose commits done by others in ‘origin’, which will be then Garbage Collected
4.3.1) Updating Remote from Local = “fetch” and “merge” then “push” = “pull” then “push”
- Fetch changes from remote. Merge changes in local repo. Push the resulted commits. (Instead of using force push)
High Level Command : git fetch
to get new objects from Remote and to update current position of Remote branch in local repo
High Level Command : git pull
= git fetch -> git merge
4.4) Never rebase shared commits
It may create conflicts with rebased “local” branch push to “origin”, with same branch available with other users before rebase operation
4.5) GitHub Features
No right access to repos from global users. So one can clone from it, but can’t push to it
4.5.1) Fork = Remote Clone
- Create own copy of project on GitHub from someone else’s GitHub account to Own GitHub accounts.
- GitHub knowns this connection. But Git doesn’t know that connection between own copy-project “origin” and others original project “upstream”
Clone on copy-project points to “origin” of Own GitHub account.
To track changes to original project, then manually need to add another remote pointing to “upstream” in (local) config file. This leads to local project with Multiple Remotes : synchronize local changes with “origin”, while changes from “upstream” can be pulled to local project
4.5.2) Pull Requests (PR)
- Due to no right access to “upstream”, so can’t push changes to it.
- But GitHub provides provision to send a message (PR) to “upstream” owner for remote pulling changes from “origin”
References: