Git: merge with submodules
If you work in a project that uses Git as its version control system and and uses submodules, maybe you have faced a situation where you needed to change a submodule in a branch and needed to merge that branch into another.
For example: assuming that your project structure is the following, with two Python files (one being inside a subdirectory), and a subdirectory containing a library. In this example, that library is not a directory that was “copied” into the project. Instead, the library is an external project that was included here as a submodule:
$ tree
├── some_file.py
├── some_directory
│ └── another_file.py
└── library_inside_a_submodule
│ └── library_file.py
└── .gitmodules
Due to the fact that the library is included as a submodule, its version will be fixed in a commit until someone explicitly change it. That can be done this way, for example:
$ git -C library_inside_a_submodule fetch
$ git -C library_inside_a_submodule checkout <commit of the new version>
$ git add library_inside_a_submodule
$ git commit
If you have done that change in a branch of yours and you want that it be
applied to another branch, you’ll probably will do it by merging them. Assuming
that the other branch is called main
and the branch of yours is called
pr-branch
:
$ git checkout main
$ git merge pr-branch
Then, the commit graph look like the following, with the dotted arrows pointing to each version of the submodule:
As we can se in the picture, we want to know what commit of the submodule the merge commit will point to.
If the branch main
haven’t been changed while you was developing on
pr-branch
, then pr-branch
will be merged as expected, and the library will
be on the new version. Otherwise, the behavior isn’t trivial:
- maybe the version on
main
will be kept; - maybe the version on
pr-branch
will be kept; - maybe we have a conflict that Git won’t be able to solve without human intervention.
From this point on, I haven’t found any content that explained in a complete way what is the behavior of Git in that situation. In order to understand it, I needed to study Git in a deeper level, even reading the source code. Then, I wrote this post :-).
Brief explaination about objects
If you are familiar about what are objects in Git and how they work, you can skip to the next section. It is not my goal here to explaing deeply how they work, only what is enough to understand merge and submodules.
So, if you are not familiar about how Git works under the hood, I suggest you to read the sections 1.3 “What is Git”, 10.2 “Git Objects” and 10.3 “Git References” of the book Pro Git, available in several formats on https://git-scm.com/book/en/v2.
A brief tl;dr with what we need: the data of a repository (I’m calling “repository” the local repository, on each machine) are stored in units called objects. Here’s a very basic graphical representation of how the objects relate to each other:
One type of object is our old friend commit. Each commit stores the state of the entire project at the time it is created (not only the changes). That state is called snapshot, as it can be understood as a picture of the project at that time.
Other types of objects are tree and blob. A tree represents the state of a directory, being, basically, a list of its contents: subdirectories (each one with its content being represented by a tree), or files (each one with its content represented by a blob).
Each commit points to the tree that represents the project root at the state that the commit stores.
Each commit also points to it(s) parent(s) commit(s), if it/they exist(s):
- A initial commit doesn’t have a parent;
- A standard commit only has one parent;
- A merge commit has two or more parents, pointing to the commits that were merged in order to this commit be created. Usually, we merge only one branch into another, consequently, the resulting commit usually has two parents.
Every object is identified by its SHA-1 hash. Objects are unique and immutable. This means, for example, that:
- two identical files are represented by the same blob;
- two identical directories are represented by the same tree;
- a file that contains different contents on distinct commits is represented by different blobs, each one representing the content of the file on each commit;
A branch is not an object. It is only a reference, a pointer to a commit. Internally, it is only a file containing a hash of a commit.
Brief explanation about merge
Fast-forward
On the same example, if we run git merge pr-branch
, we can find ourselves in a
situation where pr-branch
points to a descendent commit of the commit that
main
points to. In this case, by default Git doesn’t do a true merge,
performing a fast-forward instead. That means that Git will simply make main
point to the same commit as pr-branch
.
The situation before the fast-forward looked like the following, with the blue graph representing the commits of the project, and the red graph representing the commits of the submodule:
After the fast-forward:
This case doesn’t need further explanation about the main point of this post, as
the library will obviously be in the same version as pr-branch
after the
fast-forward, since main
will point to the same commit as pr-branch
.
True merge
A true merge happens if the commit that pr-branch
points to is not a
descendant of the commit that main
points to. Thus, the merge commit will be
created with two parents, just like I said before. That behavior can also be
forced even when it’s possible to perform a fast-forward, using the flag
--no-ff
.
git merge
may have distinct behaviors depending on the selected stragy. By
default on the newer versions of Git, it uses ort
as merge strategy. On older
versions, the default merge strategy is recursive
.
Hence, this post will describe how ort
works and it is also applicable to
recursive
, however, it is not applicable to the other ones.
Three-way merge
The decision of what is chosen to be kept in the merge commit is made by an algorithm called three-way merge. It is based on three commits: the two commits pointed by the branches that we are merging, and the best common ancestor between them.
By “best common ancestor” we mean: a common ancestor between two commits X is better than another common ancestor Y if X is descendant of Y. The “best of all” will be used. More information in git merge-base manpage.
In the following three examples, the best common ancestor of two branches A and B is the yellow commit:
Let’s denote by A
and B
the two commits pointed by the branches that we want
to merge, and By O
the best common ancestor between them. Here, a file may
have, then, up to three distinct versions, one in A
, one B
and one in O
.
In order decide which version will be used, the three-way merge algorithm follow this rule:
- If the file has the same content in both
A
andB
, it will be kept in the merge - Otherwise:
- If the file has the same content in
A
and inO
, the content inB
will be used - If the file has the same content in
B
and inO
, the content inA
will be used - Otherwise (the contents of the file diverge in
A
,B
andO
): conflict.
- If the file has the same content in
We can see that in the following image. The inner circles represent the content of a file on each commit:
The comparison between those contents isn’t made using the entire file: Git compares the hash of the contents of each state, which is enough do decide whether they are equal or not (remember: if the file is the same on different commits, its content is the same, thus it is represented by the same blob).
If there is a conflict and it’s a plaintext file, then Git, by default, performs
a three-way merge internally on the file. This way, for example, if the change
from O
to A
was introduced in a different place of the file than the change
between O
and B
, both are preserved. If they were made in the same place,
then Git adds the conflict marks (<<<<<<<
, =======
e >>>>>>>
) in order to
be resolved by the user.
Binary files are not resolved. The user must choose which one will be kept in the merge commit.
Brief explanation about submodules
Well, if you are reading this, probably you won’t need an introduction to submodules. However, it is worth to show how they are represented under the hood.
We saw that each tree represents the content of a directory. With the command
git ls-tree
we can inspect the content of a tree. If we run git ls-tree HEAD
we can check the content of the tree of the current commit. In our example, it
would be something like that:
$ git ls-tree HEAD
100644 blob 123abc456def123abc456def123abc456def9999 .gitmodules
100644 blob 1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1234 some_file.py
040000 tree aaaaabbbbbcccccdddddeeeeefffff1111122222 some_directory
160000 commit fedcba0123456789fedcba012345678912345678 library_inside_a_submodule
The hashes in the entries of the files some_file.py
and .gitmodules
point to
the blobs that store its respective contents. The hash in the entry of
some_directory
points to the tree which stores the content of that directory.
However, in the line of the library_inside_a_submodule
, the corresponding hash
is exactly the hash of the commit of the submodule containing the current version
of the library. If we change the version of that library, that hash
changes. Consequently, the tree that represent the root directory of the project
will be another object, or in other words, other snapshot.
Graphically, the submodules would be like that:
Besides that, we also have the file .gitmodules
, that stores properties of the
submodules, for example, their directories (path
) and the URL of the
repository from where it was cloned (url
). More information about
.gitmodules
can be found on its manpage.
Three-way merge of submodules
The three-way merge algorithm is also used when we merge two commits that contains submodules. It is powered by the same mechanism that performs the three-way merge of files. However, instead of comparing hashes of blobs, it compares hashes of the commits of submodules:
- If the submodule points to the same commit in both
A
andB
, that commit will be kept on the merge commit; - Otherwise:
- If the submodule points to the same commit in both
A
andO
, the commit that it points to inB
will be used in the merge commit; - If the submodule points to the same commit in both
B
andO
, the commit that it points to inA
will be used in the merge commit; - Otherwise (the commits that the submodule points to in
A
,B
andO
diverge): conflict
- If the submodule points to the same commit in both
Everything is ok until here, however things can get complicated when we have a conflict…
Resolving submodule conflicts
If we have conflicts between submodules, depending on the situation Git tries to
resolve automatically: if the commit of the submodule in A
is descendent of the
commit of the submodule in B
, the commit of the submodule in A
will be
used. The opposite is also true: if the commit of the submodule in B
is
descendant of the commit of the submodule in A
, then the commit of the
submodule in B
will be used. That behavior you can see
here, on Git source code.
In other words, a fast-forward is performed in the submodule.
Graphically, before merging A and B, the situation was like that:
After the merge that performs a fast-forward on the submodule, the situation is like that:
If it’s not possible to perform that fast-forward, then Git tells us that we have a conflict and the user should resolve it manually, like in this situation:
Even so, if one of those commits is not descendent of the other but there is a merge commit that are descendant of both, then that merge commit will be suggested to the user as a solution of the conflict. Then the user may accept the suggestion or not. That behavior can be seen here, on Git source code.
Note that those submodule conflict resolution solutions are only possible if we
locally have the corresponding commit objects of the submodule. If they are not
present (for example, due to the lack of the flag --recursive
when calling
git clone
, or due to not running a git submodule update
), then Git will not
be capable of resolve the conflict and then will indicate it.
How about GitHub and GitLab?
libgit2
is a C library that provides Git functionality. According to its
README,
both GitHub and GitLab use libgit2
. That library doesn’t try to resolve
submodule conflicts as Git does, as it can be seen
here, on libgit2 source-code,
and it is also valid for GitHub and GitLab.
Here’s how we can see that:
$ git checkout main
$ git submodule update --init
$ git branch pr-branch
# Adding a commit to the branch main
$ git -C library_inside_a_submodule checkout some_commit~ # please, note the ~
$ git add library_inside_a_submodule
$ git commit
# Adding a commit to the branch pr-branch
$ git checkout pr-branch
$ git submodule update --init
$ git -C library_inside_a_submodule checkout some_commit # descendant of the commit in main
$ git add library_inside_a_submodule
$ git commit
# Git push, in order to make PR/MR
$ git push origin pr-branch # assuming your remote is called "origin"
Now, we are in a situation like that:
Locally, in that case, if we run git checkout main && git merge pr-branch
it
would work, as Git would perform a fast-forward of the submodule. However, if
you open a Pull Request (GitHub) or Merge Request (GitLab) asking for merge
pr-branch
into main
, a submodule conflict will be indicated.
How we can solve it?
If you are in a situation like that, you can do this:
$ git fetch origin main # assuming your remote is called "origin"
$ git checkout pr-branch
$ git submodule update --init
$ git merge origin/main
$ git push origin pr-branch # assuming your remote is called "origin"
Then your commit graph will look like that:
Then, the PR/MR will not be in conflict with the branch main
. So, why does it
work? Well, we merged main
into pr-branch
, and locally it succeeded, as Git
could fast-forward the submodule.
After we git push
it:
- If no one commit any change to the
main
during that operation, thenpr-branch
will be descendant ofmain
, and it’s possible to perform a fast-forward; - If someone changed the branch
main
during that operation, then the formermain
(the one the yougit fetch
ed) would be the best common ancestor between the newermain
(with the changes introduced by the other person) andpr-branch
. This way, if the newermain
didn’t change the submodule since the oldermain
, then the commit of the submodule frompr-branch
will be chosen. It would be also true if you are not using fast-forward on GitHub or GitLab.
Conclusion
When we run git merge pr-branch
, Git tries to merge the submodules using the
three-way merge algorithm. If it is not possible, it tries to fast-forward the
submodule to the descendant commit, if there is one that is pointed by one of
the branches. GitHub and GitLab don’t perform a fast-forward, so we need to merge
locally, resolving the conflict automatically or manually, in a way that allows
GitHub or GitLab perform the three-way merge without conflicts.