# HG changeset patch # User Pierre-Yves David # Date 1567957766 -7200 # Node ID e2465969958e57dbe7ddde2f0cc6cc79813f8916 # Parent 7c38a4353bb39a3905cb2d614c97bfef2b35abf9 stablerange: add some documentation about the general concept diff -r 7c38a4353bb3 -r e2465969958e hgext3rd/evolve/stablerange.py --- a/hgext3rd/evolve/stablerange.py Sun Sep 08 17:47:37 2019 +0200 +++ b/hgext3rd/evolve/stablerange.py Sun Sep 08 17:49:26 2019 +0200 @@ -6,6 +6,226 @@ # # This software may be used and distributed according to the terms of the # GNU General Public License version 2 or any later version. +"""stable range + +General Goals and Properties +--------------------------- + +Stable-ranges get useful when some logic needs a recursive way to slice the +history of a repository in smaller and smaller group of revisions. Here is +example of such use cases: + +* **bundle caching:** + With an easy way to slice any subsets of history into stable-ranges, we + can cache a small number of bundles covering these ranges and reuse them for + other pull operations. Even if the pull operation have different boudaries. + +* **metadata discovery:** + With a simple way to recursively look at smaller and smaller ranges, an + algorithm can do fine-grained discovery of area of history where some mutable + metadata differ from one repository to another. Such meta data can be + obsolescence markers, CI status, lightweight stag, etc... + +To fix these use cases best, stable-ranges need some important properties: + +* the total number of ranges needed to cover a full repository is well bounded. +* the minimal number of ranges to cover an arbitrary subset of the history is well bounded +* for the same section of history, the range will be the same on any + repositories, +* the ranges are cheap to compute iteratively, each new revisions re-uses the + ranges previous revisions uses. + +Simple introduction to the Concepts +----------------------------------- + +To keep things simple, let us look at the issue on a linear history:: + + A -> B -> C -> D -> E -> F -> G -> H + +To make sure we have range that cover each part of the history with a good +granularity we use some binary recursion. The standard stable range will be: + + [A -> B -> C -> D -> E -> F -> G -> H] size 8 + [A -> B -> C -> D] [E -> F -> G -> H] size 4 + [A -> B] [C -> D] [E -> F] [G -> H] size 2 + [A] [B] [C] [D] [E] [F] [G] [H] size 1 + +Well bounded total number of ranges: +```````````````````````````````````` + +This binary slicing make sure we keep the total number of stable ranges under control. + +As you can see, we have N size 1 ranges. They are trivial and we don't care +about them. Then we have: N/2 size 2 ranges + N/4 size 4 ranges + N/8 size 8 +ranges, etc... So a total of about "length(repo)" standard ranges. + + +Well bounded number of range to cover a subset: +``````````````````````````````````````````````` + +Any subset of the history can be expressed with this standard ranges. + +For example, [A, F] subset, can be covered with 2 ranges:: + + [A ->[B -> C -> D] [E -> F] + +A less strivial example [B, F], still requires a small number of ranges (3):: + + [B] [C -> D] [E -> F] + +In practice, any subset can be expressed in at most "2 x log2(length(subset))" +stable range, well bounded value. + +Cheap incremental updates +````````````````````````` + +The scheme describe above result in 2N subranges for a repository is size N. We +do not want to have to recompute these 2N stable-ranges whenever a new revision +is added to the repository. To achieve these, the stable-ranges are defined by +**fixed boundaries** that are independant from the total size of the +repository. Here is how it looks like in our example. + +We start with a repository having only [A, F]. Notice how we still creates +power of two sized stable range:: + + [A -> B -> C -> D] + [A -> B] [C -> D] [E -> F] + [A] [B] [C] [D] [E] [F] + +This we simply adds a new revision G, we reuse more the range we already have:: + + [A -> B -> C -> D] + [A -> B] [C -> D] [E -> F] + [A] [B] [C] [D] [E] [F] [G] + +Adding H is a bigger even as we read a new boundary. + + [A -> B -> C -> D -> E -> F -> G -> H] + [A -> B -> C -> D] [E -> F -> G -> H] + [A -> B] [C -> D] [E -> F] [G -> H] + [A] [B] [C] [D] [E] [F] [G] [H] + +At most, adding a new revision `R` will introduces `log2(length(::R))` new +stable ranges. + +More advanced elements +---------------------- + +Of course, the history of repository is not as simple as our linear example. So +how do we deal with the branching and merging? To do so, we leverage the +"stable sort" algorithm defined in the `stablesort.py` module. To define the +stable range that compose a set of revison `::R`, we linearize the space by +sorting it. The stable sort algorithm has two important property: + +First, it give the same result on different repository. So we can use it to +algorithm involving multiple peers. + +Second, in case of merge, it reuse the same order as the parents as much as possible. +This is important to keep reusing existing stable range as the repository grow. + +How are ranges defined? +``````````````````````` + +To keep things small, and simple, a stable range always contains the final part +of a `stablesort(::R)` set of revision. It is defined by two items: + +* its head revision, the R in `stablesort(::R)` +* the size of that range... well almost, for implementation reason, it uses the + index of the first included item. Or in other word, the number of excluded + initial item in `stablesort(::R)`. + +Lets look at a practical case. In our initial example, `[A, B, C, D, E, F, G, +H]` is H-0; `[E, F, G, H]` is H-4; `[G, H]` is H-6 and `[H]` is H-7. + +Let us look at a non linar graph:: + + A - B - C - E + | / + -D + +and assume that `stablesort(::E) = [A, B, C, D, E]`. Then `[A, B, C]` is C-0, +`[A, B, D]` is D-0; `[D, E]` is E-3, `[E]` is E-4, etc... + +Slicing in a non linear context +``````````````````````````````` + +Branching can also affect the way we slice things. + +The small example above offers a simple example. For a size 5 (starting at the +root), standard slicing will want a size 4 part and size 1 part. So, in a +simple linear space `[A, B, C, D, E]` would be sliced as `[A, B, C, D] + [E]`. +However, in our non-linear case, `[A, B, C, D]` has two heads (C and D) and +cannot be expressed with a single range. As a result the range will be sliced +into more sub ranges:: + + stdslice(A-0) = [A, B, C] + [D] + [E] = C-0 + D-2 + A-4 + +Yet, this does not mean ranges containing a merge will always result in slicing +with many revision. the sub ranges might also silently contains them. Let us +look at an exemple:: + + A - B - C - D - E --- G - H + | / + ---------- F + +with:: + + `stablesort(::H) == [A, B, C, D, E, F, G, H]` + +then:: + + stdslice(H-0) = [A, B, C, D] + [E, F, G, H] = D-0 + H-4 + +As a result the non linearity will increase the number of subranges involved, +but in practice the impact stay limited. + +The total number of standard subranges stay under control with about +`O(log2(N))` new stable range introduced for each new revision. + +In addition, it is worth nothing that the head of the extra ranges we have to +use will match the destination of the "jump" cached by the stablesort +algorithm. So, all this slicing can usually be done without iterating over the +stable sorted revision. + +Caching Strategy +---------------- + +The current caching strategy use a very space inefficient sqlite database. +testing show it often take 100x more space than what basic binary storage would +take. The sqlite storage was very useful at the proof of concept stage. + +Since all new stable-ranges introduced by a revision R will be "R headed". So we +could easily store their standard subranges alongside the revision information +to reuse the existing revision index. + +Subrange informatin can be efficiently stored, a naive approach storing all +stable ranges and their subranges would requires just 2 integer per range + 2 +integer for any extra sub-ranges other than the first and last ones. + +We can probably push efficiency further by taking advantage of the large +overlap in subranges for one non-merge revision to the next. This is probably a +premature optimisation until we start getting actual result for a naive binary +storage. + +To use this at a large scale, it would be important to compute these data at +commit time and to exchange them alongside the revision over the network. This +is similar to what we do for other cached data. + +Performance +----------- + +The current implementation has not been especially optimized for performance. +The goal was mostly to get the order of magnitude of the algorithm complexity. +The result are convincing: medium repository get a full cache warming in a +couple of seconds and even very large and branchy repository get a fully warmed +in the order of tens of minutes. We do not observes a complexity explosion +making the algorithm unusable of large repositories. + +A better cache implementation combined with an optimized version of the +algorithm should give much faster performance. Combined with commit-time +computation and exchange over the network, the overall impact of this should be +invisible to the user. +""" import abc import functools