stablerange: add some documentation about the general concept
authorPierre-Yves David <pierre-yves.david@octobus.net>
Sun, 08 Sep 2019 17:49:26 +0200
changeset 4836 e2465969958e
parent 4835 7c38a4353bb3
child 4837 b59231359479
stablerange: add some documentation about the general concept
hgext3rd/evolve/stablerange.py
--- a/hgext3rd/evolve/stablerange.py	Sun Sep 08 17:47:37 2019 +0200
+++ b/hgext3rd/evolve/stablerange.py	Sun Sep 08 17:49:26 2019 +0200
@@ -6,6 +6,226 @@
 #
 # This software may be used and distributed according to the terms of the
 # GNU General Public License version 2 or any later version.
+"""stable range
+
+General Goals and Properties
+---------------------------
+
+Stable-ranges get useful when some logic needs a recursive way to slice the
+history of a repository in smaller and smaller group of revisions. Here is
+example of such use cases:
+
+* **bundle caching:**
+  With an easy way to slice any subsets of history into stable-ranges, we
+  can cache a small number of bundles covering these ranges and reuse them for
+  other pull operations. Even if the pull operation have different boudaries.
+
+* **metadata discovery:**
+  With a simple way to recursively look at smaller and smaller ranges, an
+  algorithm can do fine-grained discovery of area of history where some mutable
+  metadata differ from one repository to another. Such meta data can be
+  obsolescence markers, CI status, lightweight stag, etc...
+
+To fix these use cases best, stable-ranges need some important properties:
+
+* the total number of ranges needed to cover a full repository is well bounded.
+* the minimal number of ranges to cover an arbitrary subset of the history is well bounded
+* for the same section of history, the range will be the same on any
+  repositories,
+* the ranges are cheap to compute iteratively, each new revisions re-uses the
+  ranges previous revisions uses.
+
+Simple introduction to the Concepts
+-----------------------------------
+
+To keep things simple, let us look at the issue on a linear history::
+
+  A -> B -> C -> D -> E -> F -> G -> H
+
+To make sure we have range that cover each part of the history with a good
+granularity we use some binary recursion. The standard stable range will be:
+
+ [A -> B -> C -> D -> E -> F -> G -> H] size 8
+ [A -> B -> C -> D]  [E -> F -> G -> H] size 4
+ [A -> B]  [C -> D]  [E -> F]  [G -> H] size 2
+ [A]  [B]  [C]  [D]  [E]  [F]  [G]  [H] size 1
+
+Well bounded total number of ranges:
+````````````````````````````````````
+
+This binary slicing make sure we keep the total number of stable ranges under control.
+
+As you can see, we have N size 1 ranges. They are trivial and we don't care
+about them. Then we have: N/2 size 2 ranges + N/4 size 4 ranges + N/8 size 8
+ranges, etc... So a total of about "length(repo)" standard ranges.
+
+
+Well bounded number of range to cover a subset:
+```````````````````````````````````````````````
+
+Any subset of the history can be expressed with this standard ranges.
+
+For example, [A, F] subset, can be covered with 2 ranges::
+
+  [A ->[B -> C -> D]  [E -> F]
+
+A less strivial example [B, F], still requires a small number of ranges (3)::
+
+  [B]  [C -> D]  [E -> F]
+
+In practice, any subset can be expressed in at most "2 x log2(length(subset))"
+stable range, well bounded value.
+
+Cheap incremental updates
+`````````````````````````
+
+The scheme describe above result in 2N subranges for a repository is size N. We
+do not want to have to recompute these 2N stable-ranges whenever a new revision
+is added to the repository. To achieve these, the stable-ranges are defined by
+**fixed boundaries** that are independant from the total size of the
+repository. Here is how it looks like in our example.
+
+We start with a repository having only [A, F]. Notice how we still creates
+power of two sized stable range::
+
+  [A -> B -> C -> D]
+  [A -> B]  [C -> D]  [E -> F]
+  [A]  [B]  [C]  [D]  [E]  [F]
+
+This we simply adds a new revision G, we reuse more the range we already have::
+
+  [A -> B -> C -> D]
+  [A -> B]  [C -> D]  [E -> F]
+  [A]  [B]  [C]  [D]  [E]  [F]  [G]
+
+Adding H is a bigger even as we read a new boundary.
+
+  [A -> B -> C -> D -> E -> F -> G -> H]
+  [A -> B -> C -> D]  [E -> F -> G -> H]
+  [A -> B]  [C -> D]  [E -> F]  [G -> H]
+  [A]  [B]  [C]  [D]  [E]  [F]  [G]  [H]
+
+At most, adding a new revision `R` will introduces `log2(length(::R))` new
+stable ranges.
+
+More advanced elements
+----------------------
+
+Of course, the history of repository is not as simple as our linear example. So
+how do we deal with the branching and merging? To do so, we leverage the
+"stable sort" algorithm defined in the `stablesort.py` module. To define the
+stable range that compose a set of revison `::R`, we linearize the space by
+sorting it. The stable sort algorithm has two important property:
+
+First, it give the same result on different repository. So we can use it to
+algorithm involving multiple peers.
+
+Second, in case of merge, it reuse the same order as the parents as much as possible.
+This is important to keep reusing existing stable range as the repository grow.
+
+How are ranges defined?
+```````````````````````
+
+To keep things small, and simple, a stable range always contains the final part
+of a `stablesort(::R)` set of revision. It is defined by two items:
+
+* its head revision, the R in `stablesort(::R)`
+* the size of that range... well almost, for implementation reason, it uses the
+  index of the first included item. Or in other word, the number of excluded
+  initial item in `stablesort(::R)`.
+
+Lets look at a practical case. In our initial example, `[A, B, C, D, E, F, G,
+H]` is H-0; `[E, F, G, H]` is H-4; `[G, H]` is H-6 and `[H]` is H-7.
+
+Let us look at a non linar graph::
+
+ A - B - C - E
+       |    /
+        -D
+
+and assume that `stablesort(::E) = [A, B, C, D, E]`. Then `[A, B, C]` is C-0,
+`[A, B, D]` is D-0; `[D, E]` is E-3, `[E]` is E-4, etc...
+
+Slicing in a non linear context
+```````````````````````````````
+
+Branching can also affect the way we slice things.
+
+The small example above offers a simple example. For a size 5 (starting at the
+root), standard slicing will want a size 4 part and size 1 part. So, in a
+simple linear space `[A, B, C, D, E]` would be sliced as `[A, B, C, D] + [E]`.
+However, in our non-linear case, `[A, B, C, D]` has two heads (C and D) and
+cannot be expressed with a single range. As a result the range will be sliced
+into more sub ranges::
+
+    stdslice(A-0) = [A, B, C] + [D] + [E] = C-0 + D-2 + A-4
+
+Yet, this does not mean ranges containing a merge will always result in slicing
+with many revision. the sub ranges might also silently contains them. Let us
+look at an exemple::
+
+  A - B - C - D - E --- G - H
+        |             /
+         ---------- F
+
+with::
+
+  `stablesort(::H) == [A, B, C, D, E, F, G, H]`
+
+then::
+
+  stdslice(H-0) = [A, B, C, D] + [E, F, G, H] = D-0 + H-4
+
+As a result the non linearity will increase the number of subranges involved,
+but in practice the impact stay limited.
+
+The total number of standard subranges stay under control with about
+`O(log2(N))` new stable range introduced for each new revision.
+
+In addition, it is worth nothing that the head of the extra ranges we have to
+use will match the destination of the "jump" cached by the stablesort
+algorithm. So, all this slicing can usually be done without iterating over the
+stable sorted revision.
+
+Caching Strategy
+----------------
+
+The current caching strategy use a very space inefficient sqlite database.
+testing show it often take 100x more space than what basic binary storage would
+take. The sqlite storage was very useful at the proof of concept stage.
+
+Since all new stable-ranges introduced by a revision R will be "R headed". So we
+could easily store their standard subranges alongside the revision information
+to reuse the existing revision index.
+
+Subrange informatin can be efficiently stored, a naive approach storing all
+stable ranges and their subranges would requires just 2 integer per range + 2
+integer for any extra sub-ranges other than the first and last ones.
+
+We can probably push efficiency further by taking advantage of the large
+overlap in subranges for one non-merge revision to the next. This is probably a
+premature optimisation until we start getting actual result for a naive binary
+storage.
+
+To use this at a large scale, it would be important to compute these data at
+commit time and to exchange them alongside the revision over the network. This
+is similar to what we do for other cached data.
+
+Performance
+-----------
+
+The current implementation has not been especially optimized for performance.
+The goal was mostly to get the order of magnitude of the algorithm complexity.
+The result are convincing: medium repository get a full cache warming in a
+couple of seconds and even very large and branchy repository get a fully warmed
+in the order of tens of minutes. We do not observes a complexity explosion
+making the algorithm unusable of large repositories.
+
+A better cache implementation combined with an optimized version of the
+algorithm should give much faster performance. Combined with commit-time
+computation and exchange over the network, the overall impact of this should be
+invisible to the user.
+"""
 
 import abc
 import functools