Merging Sets

Page last updated

Some complementary code that implements the following logic in TypeScript can be found here.

I find the problem set of representing conflicts fascinating. I would love to have reliable structures and an algebra around it that would enable us to safely reason about merges in the abstract.

Something that has been fun to play with has been trying to define a merge operation on sets from some sort of first principles. This has limited immediate use for VCS like GitButler, but might have a thread that can be followed to a fuller representation.

In the following examples, I’m concerned with defining a basic merge with a base and two sides that produces an output set. I’ll be drawing examples in the following shape where I want to find the merged result of BB and CC. I’m imagining the “contents” of each of these nodes as a set.

BAMC \begin{array}{ccccc} & & B & & \\ & \nearrow & & \searrow & \\ A & & & & M\\ & \searrow & & \nearrow & \\ & & C & & \\ \end{array}

The rules for a merge when only one side has made a change is pretty unambiguous, where we are only interested in the changed version. In this exploration, if the same change has been made to both sides, I’m taking one of the sides. This is generally the most practical approach since presenting same side conflicts to the user is generally impractical. It should however be noted, that it’s this absorption that is what can result in the uncherrypick identity problem.

If A=B=CA = B = C, then MM = AAB={a}A={a}M={a}C={a} \begin{array}{ccccc} & & B = \{a\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{a\} \\ & \searrow & & \nearrow & \\ & & C = \{a\} & & \\ \end{array}
If A=CA = C and ABA \neq B, then M=BM = B

BB adds an item to its set:

B={a,b}A={a}M={a,b}C={a} \begin{array}{ccccc} & & B = \{a, b\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{a, b\} \\ & \searrow & & \nearrow & \\ & & C = \{a\} & & \\ \end{array}

BB removes an item from its set:

B={}A={a}M={}C={a} \begin{array}{ccccc} & & B = \{\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{\} \\ & \searrow & & \nearrow & \\ & & C = \{a\} & & \\ \end{array}

BB replaces an item in its set:

B={b}A={a}M={b}C={a} \begin{array}{ccccc} & & B = \{b\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{b\} \\ & \searrow & & \nearrow & \\ & & C = \{a\} & & \\ \end{array}
If C=BC = B, then M=CM = C

BB and CC both add the same items to their sets.

B={a,b}A={a}M={a,b}C={a,b} \begin{array}{ccccc} & & B = \{a, b\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{a, b\} \\ & \searrow & & \nearrow & \\ & & C = \{a, b\} & & \\ \end{array}

BB and CC both remove the same items from their sets.

B={}A={a}M={}C={} \begin{array}{ccccc} & & B = \{\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{\} \\ & \searrow & & \nearrow & \\ & & C = \{\} & & \\ \end{array}

BB and CC both replace the same items from their sets.

B={b}A={a}M={b}C={b} \begin{array}{ccccc} & & B = \{b\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{b\} \\ & \searrow & & \nearrow & \\ & & C = \{b\} & & \\ \end{array}

By extension, if A=B=CA = B = C, then M=AM = A

B={a}A={a}M={a}C={a} \begin{array}{ccccc} & & B = \{a\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{a\} \\ & \searrow & & \nearrow & \\ & & C = \{a\} & & \\ \end{array}

It’s a little more interesting what MM should be when ABA \neq B and ACA \neq C, but it in some ways it is also kind of a natural extension of the A=CA = C and ABA \neq B cases. I’ve listed a reasonably complete set of permutations with what I think would be the expected results.

Scenarios when ABA \neq B and ACA \neq C

BB adds an item to its set, CC adds an item to its set:

B={a,b}A={a}M={a,b,c}C={a,c} \begin{array}{ccccc} & & B = \{a, b\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{a, b, c\} \\ & \searrow & & \nearrow & \\ & & C = \{a, c\} & & \\ \end{array}

BB adds an item to its set, CC removes an item from its set:

B={a,b}A={a}M={b}C={} \begin{array}{ccccc} & & B = \{a, b\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{b\} \\ & \searrow & & \nearrow & \\ & & C = \{\} & & \\ \end{array}

BB adds an item to its set, CC replaces its item with that same item (absorption case):

B={a,b}A={a}M={b}C={b} \begin{array}{ccccc} & & B = \{a, b\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{b\} \\ & \searrow & & \nearrow & \\ & & C = \{b\} & & \\ \end{array}

BB and CC each remove an item from the set.

B={a}A={a,b}M={}C={b} \begin{array}{ccccc} & & B = \{a\} & & \\ & \nearrow & & \searrow & \\ A = \{a, b\} & & & & M = \{\} \\ & \searrow & & \nearrow & \\ & & C = \{b\} & & \\ \end{array}

BB and CC both replace the base item with different items, absorbing the shared removal.

B={b}A={a}M={b,c}C={c} \begin{array}{ccccc} & & B = \{b\} & & \\ & \nearrow & & \searrow & \\ A = \{a\} & & & & M = \{b, c\} \\ & \searrow & & \nearrow & \\ & & C = \{c\} & & \\ \end{array}

While we’re listing our dreams, we can also think about what some desirable properties of a merge function would be:

  • m(B,C)=m(C,B)m(B, C) = m(C, B)
  • With AA as a base, BB, CC, and DD as children,
    • m(m(B,C),D)=m(m(B,D),C)m(m(B, C), D) = m(m(B, D), C)
    • m(B,C,D)=m(m(B,C),D)m(B, C, D) = m(m(B, C), D)

Deriving a merge from deltas

Another way to think about what MM should contain is in terms of the set of deltas that when applied together constitute the merged result. This looks like:

B={δOA,δAB}A={δOA}M={δOA,δAB,δAC}C={δOA,δAC} \begin{array}{ccccc} & & B = \{\delta_O^A, \delta_A^B\} & & \\ & \nearrow & & \searrow & \\ A = \{\delta_O^A\} & & & & M = \{\delta_O^A, \delta_A^B, \delta_A^C\} \\ & \searrow & & \nearrow & \\ & & C = \{\delta_O^A, \delta_A^C\} & & \\ \end{array}

With our sets, what does a delta look like? A delta needs to be able to describe how to get from an arbitrary set to another arbitrary set. Perhaps I’ve been brainwashed by working with text diffing for too long, but the first thing that came to me was to define a delta as a series of additions and removals.

I started by defining the set RemovalsRemovals that represents the elements that would need to be removed in order to go from AA to BB, and AdditionsAdditions which represents all the elements that would need to be added in order to go from AA to BB. Together they make up a complete delta.

We can define RemovalsRemovals as ABA \sand B', and AdditionsAdditions as BAB \sand A'.

With our delta, we can then concern ourselves with applying it to a set. We can do so by doing (ARemovals)Additions(A \sand Removals') \sor Additions. When expanded, we would expect to see that applying δAB\delta_A^B to AA would simplify down to BB, which is indeed what we get.

(ARemovals)Additions=(A(AB))(BA)=(A(AB))(BA)=(AA)(AB)(BA)=(AB)(BA)=B(AA)=B\begin{align*} & (A \sand Removals') \sor Additions\\ = & (A \sand (A \sand B')') \sor (B \sand A')\\ = & (A \sand (A' \sor B)) \sor (B \sand A')\\ = & (A \sand A') \sor (A \sand B) \sor (B \sand A')\\ = & (A \sand B) \sor (B \sand A')\\ = & B \sand (A \sor A')\\ = & B \end{align*}

Merging

I now have a way of describing δAB\delta_A^B and δAC\delta_A^C, so how can I apply both of them to AA simultaneously?

One interesting property of these two deltas is that the union of the removals and the union of the additions are mutually exclusive.

(RemovalsBRemovalsC)(AdditionsBAdditionsC)=((AB)(AC))((AB)(AC))=(A(BC))(A(BC))=(BC)(BC)=\begin{align*} & (Removals_B \sor Removals_C) \sand (Additions_B \sor Additions_C)\\ = & ((A \sand B') \sor (A \sand C')) \sand ((A' \sand B) \sor (A' \sand C))\\ = & (A \sand (B' \sor C')) \sand (A' \sand (B \sor C))\\ = & \empty \sand (B' \sor C') \sand (B \sor C)\\ = & \empty\\ \end{align*}

Despite this fact, my first attempt was to still try and model the merge by first combining two deltas, and applying them together with (A(RemovalsBRemovalsC))AdditionsBAdditionsC(A \sand (Removals_B \sor Removals_C)') \sor Additions_B \sor Additions_C. This simplifies down quite nicely and does fulfil our listed scenarios.

(A(RemovalsBRemovalsC))AdditionsBAdditionsC=(A((AB)(AC)))(AB)(AC)=(A(AB)(AC))(AB)(AC)=(BC)(AB)(AC)=(BC)(A(BC))\begin{align*} & (A \sand (Removals_B \sor Removals_C)') \sor Additions_B \sor Additions_C\\ = & (A \sand ((A \sand B') \sor (A \sand C'))') \sor (A' \sand B) \sor (A' \sand C)\\ = & (A \sand (A' \sor B) \sand (A' \sor C)) \sor (A' \sand B) \sor (A' \sand C)\\ = & (B \sand C) \sor (A' \sand B) \sor (A' \sand C)\\ = & (B \sand C) \sor (A' \sand (B \sor C))\\ \end{align*}

With the two sets being mutually exclusive, it means that applying the two deltas sequentially to AA, or simply applying δAC\delta_A^C to BB, should provide the same result, which it indeed does.

(BRemovalsC)AdditionsC=(B(AC))(AC)=(B(AC))(AC)=(BA)(BC)(AC)=(BC)(A(BC))\begin{align*} & (B \sand Removals_C') \sor Additions_C\\ = & (B \sand (A \sand C')') \sor (A' \sand C)\\ = & (B \sand (A' \sor C)) \sor (A' \sand C)\\ = & (B \sand A') \sor (B \sand C) \sor (A' \sand C)\\ = & (B \sand C) \sor (A' \sand (B \sor C))\\ \end{align*}

With our simplified expression, we can see that our first requirement m(B,C)=m(C,B)m(B, C) = m(C, B) holds due to (BC)(B \sand C) and (BC)(B \sor C) both commuting. Further, due to the mutually exclusive nature of the removals and additions, m(m(B,C),D)=m(m(B,D),C)=m(B,C,D)m(m(B, C), D) = m(m(B, D), C) = m(B, C, D) also falls out.

Absorption and Badmerge

Something that is very apparent is that we run into the uncherrypick identity issue I described before. Unless we change the expected outcomes or have a richer underlying representation of a set, we’re however not going to avoid this issue.

It’s not quite all doom and gloom though. What is very nice is it becomes very clear where the terms crop up! The overlap between any of the Addition terms or between any of the Removal terms will result in absorption.

This approach is also susceptible to the badmerge scenario.

I am hopeful that there might be some structure that operates similarly to a set while still answering some of the open questions that we have, which is part of what makes exploring topics like this interesting.

Potential use cases for this set merge

VCSes like Git and Mercurial are both susceptible to absorption and the badmerge scenarios, but their occurrence in practice is rare, so using systems that have these properties is not unconscionable.

One use case I imagine benefiting from exploring using sets like this would be local-first software. LWW-Sets are a common favourite in the space, but can often result in a loss of work if an unfavourable race is won. An interesting alternative would be exploring whether we can represent these conflicts as a set of potential options.

We can imagine that a user has defined their favourite colour as “Green”, which we store as the set {Green}\{Green\}. They then change it to {Blue}\{Blue\} on their phone, and {Red}\{Red\} on their laptop. When they both merge histories, they end up with the set {Blue,Red}\{Blue, Red\}. The set would get presented as the user as some conflict that they ought to resolve. The benefit of doing this is that if yet another device had a different unsynced change, we can still go ahead and merge that as well, so we don’t need to stop and actively resolve any conflicts.

The behaviour of the set could also be extended to a mergeable map structure for storing collections of unordered data, though, I’ll pursue that further another time.

Acknowledgements

Thanks to my Dad for going back and forth on some of these ideas with me. Also thanks to the folks at the Mercurial sprint where some of these ideas were seeded.