Lost or corrupted undo-tree history

6 January 2020 Undo-tree is the most popular of the Emacs packages I wrote and maintain,1 1Though not the largest or most sophisticated. That credit certainly goes to my predictive package and accompanying data structure and algorithm libraries. not least because it's used by Evil and Spacemacs. However, for many years, the internet has been awash2 2OK, maybe not awash. More of a small rivulet running into the little Emacs backwater of the internet. with reports of undo history being lost or corrupted when using undo-tree-mode. Trouble was, though I do all my text editing in Emacs, and always have global-undo-tree-mode enabled, I'd never encountered these issues myself. And no one had ever been able to send me a recipe for reliably reproducing them. Making it fiendishly difficult to even begin to investigate what might be going on.

This post is about some recent changes to undo-tree that might finally address these issues.

Lost undo history: the dreaded "no further undo information" message

Both Emacs' vanilla undo system and the undo-tree system print this message when there are no further changes to be undone in the buffer. Undo-tree users had been reporting for years that they were seeing this message, and weren't able to undo any further, even though they hadn't undone all the way back to the very first change they'd made to the buffer.

Finally, last year, user and Spacemacs contributor duianto very helpfully posted on the Spacemacs github issue tracker a simple recipe to reproduce this issue. Finally having a reliable way to reproduce this issue, it was then easy to investigate what was going on. And it turned out that …

undo-tree-mode was working completely correctly, and behaving exactly as documented.

When the size of the undo history recorded by Emacs grows larger than the limit set by the undo-limit configuration variable3 3In fact, there are three different configuration variables controlling when Emacs discards undo history: undo-limit, undo-strong-limit and undo-outer-limit. But the details aren't so important here. See the relevant section of the Emacs manual for the full story. , Emacs discards the oldest changes recorded in the undo history, to bring the history size back below this limit. Undo-tree implements exactly the same behaviour, discarding the oldest changes when the size of the undo tree grows bigger than undo-limit. duianto's recipe generated enough changes for the undo history stored in the undo tree to grow larger than undo-limit. So undo-tree duly discarded the oldest changes.

One response to this would be to yell RTFM at everyone.

But I think a much more useful and instructive response is to consider why users thought this behaviour was a bug in undo-tree. Whereas (as far as I'm aware) no one reported this issue with vanilla Emacs undo, despite it discarding undo history in exactly the same way.

I can think of a few explanations:

  1. Undo-tree gave no indication to the user when undo history was being discarded. It just silently discarded the oldest history when it grew beyond undo-limit. So users weren't expecting undo history to have been discarded.

This alone doesn't explain it, as Emacs' vanilla undo system gives no such indication, either. But…

  1. Because of the way undo-tree-mode stores undo history as a tree of changes, it typically takes far fewer undo operations before you run out of changes to undo. If you repeatedly undo, you run out of changes as soon as you reach the top of the tree. Other lines of undo history are stored in other branches of the tree, reached by switching to a different undo branch rather than repeatedly undoing.

    Whereas with vanilla Emacs' linear undo history, you have to wind all the way back through the entire undo history, undoing changes, then undoing previous undos, then undoing the undoing of the undos, and so on, before you reach the beginning of the undo history and run out of things to undo.4 4If you want to understand this better, the Commentary at the front of the undo-tree.el package has an extensive description (complete with retro ASCII-art diagrams!) of how the two undo systems work.

  2. With vanilla Emacs undo, it's hard to know where you've got to in the undo history, or how far back you expect to reach. The very act of undoing creates more undo history. And as soon as you break a sequence of undos with any other command (even moving the point), you start right back from the beginning, undoing your recent undos (which is how vanilla Emacs implements redo operations).

    By making the undo history conceptually simpler to navigate, and even displaying the whole undo-tree visually, undo-tree-mode makes it much easier to know where you are in the undo history, where you want to get to, and therefore much more noticable if undo history has been discarded.

Viewed in this light, these non-bug reports are an indication that undo-tree is doing a good job of making Emacs undo more usable, but a bad job of meeting users' expectations. Apparently users expect undo history to be preserved indefinitely – not unreasonable given modern computing resources. So they perceive it to be a bug if any undo history is lost. And undo-tree makes it that much easier to notice that some undo history has been lost, compared to vanilla Emacs undo.

If software isn't matching user's expectations, there may not be a bug in the code. But in my view, there is still a bug: in the user experience. These possible explanations for the non-bug reports suggest two ways to address this in undo-tree: reduce user's expectations, or change undo-tree's behaviour to match user's expectations. In the latest undo-tree release, I've tried to do both.

Emacs famously solved the complaints about garbage-collection freezes due to its stop-the-world garbage collector by simply disabling the message telling users garbage collection was happening. The undo-tree history discarding issue seems to call for the opposite solution. Previously, undo-tree (like vanilla Emacs) didn't warn users when undo history was discarded due to breaching the undo-limit.5 5Except when undo-outer-limit is breached, when both undo systems display a prominent warning that even the last buffer modification cannot be undone. In the new release, undo-tree displays a message in the echo area when undo history is discarded, and points users to the undo-limit configuration variable.

At the same time, users clearly expect more undo history to be preserved when using undo-tree-mode. The Emacs default for undo-limit is 80kB, which is very conservative. It's a very safe bet that any user running Emacs with undo-tree-mode these days has a lot more spare RAM than this. So the new version of undo-tree increases the default undo history size limit to 80MB when running undo-tree-mode. More precisely, there are three new configuration variables, undo-tree-limit, undo-tree-strong-limit and undo-tree-outer-limit, all with larger default values than their vanilla Emacs counterparts. undo-tree-mode sets the undo history size limit to whichever of undo-limit or undo-tree-limit is the larger (and similarly for the other configuration variables).

Finally, if you really don't want undo-tree to ever discard any undo history, no matter how large it grows, setting undo-tree-limit to nil will now make undo-tree preserve undo history indefinitely.6 6As far as possible. Undo-history-discarding is hard-wired into Emacs' garbage collector (see later in this post). So it's not 100% possible to prevent Emacs from discarding undo history behind undo-tree's back in certain circumstances. But setting undo-tree-limit to nil makes those circumstances rather (British understatement warning!) unlikely: you would need to generate more than 2 exa-bytes of undo history, without ever leaving Emacs idle for more than 5 seconds whilst doing so. I'm not sure this is a wise idea, particularly if coupled with the popular undo-tree-auto-save-history setting which persistently saves undo-tree history to file. Over time, undo history could easily grow large enough to become very slow to load and save, or even use. So this is not the default setting. But the option's there now for those who want to live on the edge.

Corrupted undo-tree history

The undo-history-discarding issue is only half the story, though. There has also been a steady trickle of reports of corrupted undo history, resulting in either errors or (worse still) garbled buffer text when undoing. Unfortunately, I still have yet to be sent a recipe for reliably reproducing this issue, and I can't reproduce it myself. So I'm still shooting in the dark on this one. However, I have a sneaking suspicion a race condition with the Emacs garbage collector might be responsible…

To understand why, you need to understand a little of how Emacs' undo system works behind the scenes, and how undo-tree manages to replace this at the Elisp level, without changing any Emacs c code. Whenever you make any modification to the text in a buffer, Emacs pushes a description of those changes onto the front of the list stored in the buffer-local buffer-undo-list variable. This variable gets special treatment during garbage collection. Whenever GC runs, it checks if the size of buffer-undo-list exceeds undo-limit (or one of the other undo size limits), and deletes enough older undo history from the tail of buffer-undo-list to bring it back within the configured size limits.

Much of this mechanism is hard-coded deep down in Emacs c code, and runs outside the context of the Elisp interpreter. Undo-tree stays well away from this mechanism, and doesn't mess with it at all. In undo-tree-mode, undo history still accumulates in buffer-undo-list. But whenever you call any undo-tree command (undo-tree-undo, undo-tree-redo, undo-tree-visualize etc.), the first thing it does is to transfer any undo history that has accumulated in buffer-undo-list since the last time, into the undo-tree package's own buffer-undo-tree data structure. In this way, undo-tree doesn't have to reimplement any of the code for recording undo history (which would anyway be very difficult to do from pure Elisp code). Instead, it can rely on Emacs' built-in, deeply embedded, thoroughly road-tested implementation of this.

However, this design means undo history continues to accumulate in buffer-undo-list until the next undo-tree command is called. If the user doesn't run any undo-tree commands for a long period, buffer-undo-list can easily grow larger than undo-limit and get garbage collected before any undo-tree command is called, i.e. before undo-tree even has a chance to see it.7 7You might wonder why undo-tree doesn't set up a timer to regularly transfer undo history from buffer-undo-list to buffer-undo-tree, or maybe an after-change-functions hook, or similar, to reduce the chances of this happening. But we'll see shortly that there's no benefit whatsoever to doing that. Indeed, it would be less reliable. It's critical there are no gaps in the undo history; it must contain a continuous, unbroked chain of changes. Otherwise, undoing across a gap in the history would certainly lead to corrupted buffer text at best, errors at worse.

With older undo history still stored in buffer-undo-tree, somewhat more recent history deleted before undo-tree saw it, and the newest history still stored in buffer-undo-list, doesn't this mean undo-tree will inevitably end up corrupting the undo history? Is this the source of the bugs? No! It would be a very poor design if that were the case. The potential GC race condition is more subtle.

Let's think through the scenario where Emacs garbage collects undo history exceeding undo-limit before undo-tree gets to see it. Emacs only ever garbage collects just enough undo history to bring it back within the limit, in order to preserve as much undo history as possible. So the undo data remaining in buffer-undo-list after GC is as much as Emacs is allowed to store, according to the current undo-limit setting. Therefore, the correct thing for undo-tree to do in this scenario is to discard the contents of buffer-undo-tree entirely, and build a fresh undo tree from scrach using just the contents of buffer-undo-list. This freshly built tree will already contain as much undo history as undo-tree is allowed to store, according to the configured undo-limit.

All that's required is some mechanism to detect that Emacs has garbage collected undo history since undo-tree last saw the contents of buffer-undo-list. We could try to check the size of buffer-undo-list against undo-limit (and the other settings). But this would be very unreliable. Emacs only discards undo history in chunks ("undo changesets" in Emacs terminology); it doesn't discard individual changes. A very large changeset that just barely pushes buffer-undo-list over undo-limit would, when discarded, leave a buffer-undo-list that's much smaller than the limit. But the same undo history could equally well have been produced by a sequence of small changes that never pushed buffer-undo-list over undo-limit at all.

Instead, undo tree adds a special indicator entry to the very end of buffer-undo-list. When undo tree transfers undo history from buffer-undo-list to buffer-undo-tree, it checks whether this special indicator is still present or not.8 8Like miners of yore who sometimes took a sacrificial caged bird with them down coal mines, to serve as a signal of dangerous build-ups of toxic gas: if the canary died, it was time to get out of the mine as fast as possible. Undo-tree uses the symbol 'undo-tree-canary for the special indicator entry at the end of buffer-undo-list. If the undo-tree-canary is no longer there, it means there was a dangerous build-up of undo history. Since garbage collection deletes the oldest undo history from the end of buffer-undo-list first, the first thing that gets deleted is this special indicator entry. If it's missing, it means Emacs has discarded some undo history since undo-tree last saw it, and a fresh buffer-undo-tree should be built from the remaining contents of buffer-undo-list.

The undo-tree package's undo-list-transfer-to-tree function is the only interface between Emacs' low-level undo history recording mechanisms, and undo-tree's system. This is quite a simple function: all it does is pop entries off buffer-undo-list, and add them to buffer-undo-tree. It doesn't manipulate the undo data itself in any way.9 9This isn't quite true. Undo-tree does something special with marker motion entries for other reasons. But it seems unlikely this has anything to do with the reported bugs and errors. And when it comes to actually performing undo operations, undo-tree doesn't even do these itself! Rather, it passes this same undo data back to Emacs' own primitive-undo function – the very same function that Emacs' vanilla undo system uses. So there's very little scope for corrupting the undo data, and very little code footprint where this could occur. The simple undo-list-transfer-to-tree function is the only plausible candidate.

But Emacs' GC runs outside the Elisp interpreter. It can be triggered at any point whilst running Elisp code, between pretty much any pair of Elisp instructions. What if GC happens to run whilst undo-list-transfer-to-tree is in the middle of popping undo entries off buffer-undo-list? As long as GC deletes part of buffer-undo-list that comes after the entry undo-list-transfer-to-tree is currently processing, when it resumes after GC undo-list-transfer-to-tree will just notice the missing indicator when it gets to the end of buffer-undo-list, and build a fresh buffer-undo-tree from the undo history it's just processed. But if GC happens to delete part of the changeset undo-list-transfer-to-tree is in the middle of processing, it could end up transferring a partial undo changeset. That could potentially lead to corrupted undo history.

Worse still is if GC deletes part of buffer-undo-list starting from a point before the entry undo-list-transfer-to-tree has already reached. Emacs' mark-and-sweep GC shouldn't delete any elements reachable from Elisp, which in this scenario would include the whole of buffer-undo-list. But that variable is a glaring exception. Its contents are always reachable from Elisp, via buffer-undo-list itself. Yet Emacs needs to delete part of it during GC, nonetheless. So the Emacs garbage collector takes a blunt approach: cons cells get unlinked from buffer-undo-list by GC, regardless of whether something else is holding a reference to them.10 10You can test this for yourself in Emacs. Just do M-: (setq gc-test buffer-undo-list) to set the gc-test variable to the value of buffer-undo-list. Then do M-: (length test) and note down the length of the list. Now make a large enough number of buffer modifications to push the size of buffer-undo-list above undo-limit. (You can temporarily set undo-limit to a small value to reach this limit faster.) Then do M-: (length test) again. You'll see that the list in gc-test has been truncated, even though those cons cells are manifestly reachable from somewhere other than buffer-undo-list – namely, gc-test itself. In fact, it seems to be even cruder still. GC treats any Elisp variable that happens to be called buffer-undo-list in the same fashion, even if it isn't the buffer-undo-list but is, say, let-bound. In this scenario, when it resumes, undo-list-transfer-to-tree will continue processing cons cells that have been unlinked from buffer-undo-list and the behaviour is undefined, or at least undocumented.11 11The only way to know for sure would be to trace through the Emacs GC code, or step through it in gdb. Which I have neither the time nor the expertise for. It would anyway be implementation-dependent, and subject to change in future Emacs releases.

Like many race conditions, this one is extremely hard to test. It requires GC to be triggered at just the right point. Moreover, GC runs outside the context of the Elisp interpreter. So from within Elisp or the Elisp debugger, you only get to see the state of buffer-undo-list and undo-list-transfer-to-tree before and after GC, but not during. However, when testing duianto's recipe for reproducing the undo-tree history discarding issue discussed above, I sporadically saw some odd behaviour that didn't make sense from Elisp, and which I couldn't reliably reproduce. And other comments on the same Spacemacs git issue tracker circumstantially hint at GC being implicated. In any case, whether it's responsible for the trickle of undo-tree corruption bug reports or not, this potential GC race condition is still something that needs addressing.

In the latest undo-tree release, I've implemented multiple measures to reduce the likelihood of – and eliminate the effects of – this race condition:

  1. undo-list-transfer-to-tree now manually triggers GC by calling the garbage-collect function, just before it starts processing buffer-undo-list.12 12Actually, Emacs GC is so slow, triggering it every time made undo-tree's undo and redo commands annoying slow and laggy. So in fact undo-list-transfer-to-tree first checks if buffer-undo-list contains any entries, and only calls GC if it does. Undo-tree undo and redo operations do not add any additional entries to buffer-undo-list. So in a sequence of multiple undos or redos – a very common occurrence in practice – only the first one will trigger GC and create a slight pause, which is much more tolerable. This should significantly reduce the likelihood of GC occurring again whilst in the middle of processing buffer-undo-list.
  2. Previously, undo-list-transfer-to-tree popped undo entries one by one from buffer-undo-list, and built new sections of the buffer-undo-tree data structure as it went along. This gave a relatively large time window when GC could be triggered whilst still in the middle of processing buffer-undo-list. In the new version, undo-list-transfer-to-tree first deep-copies the entire contents of buffer-undo-list to a new variable, then processes the copy. In this way, GC could only affect things if it occurs whilst copying, which should be a singificantly shorter time window.
  3. Finally, undo-tree now installs a function in the post-gc-hook which sets an undo-tree-gc-flag variable to t whenever GC runs. undo-list-transfer-to-tree resets this flag to nil just before deep-copying buffer-undo-list, and checks if it's still nil afterwards. If not, it keeps retrying, until it succeeds in copying buffer-undo-list without GC being triggered during copying.

The first two measures should significantly reduce the chance of hitting this GC race condition. The third should elliminate any possibility of bad side-effects, in the unlikely event GC does still run at a bad time.

As it's extremely difficult to reliably reproduce the race condition, let alone be sure if this is what's behind the unreproducible bug reports, it's very hard to test how successful these measures will be before the release. On the other hand, it's also possible that many or all of the bug reports related to the undo history discarding non-bug discussed above, which measures discussed there should alleviate.

The new release is now out on both GNU Elpa and here. Time will tell whether these measures stem the tide (or rather, divert the rivulet).

Comments

Alexander

9 January 2020 Thanks for a cool writeup. The fix was long-awaited by many. Having learned now about the potential cause of corruption thanks to this blog post, I wanted to ask you whether you considered to simply wrap the body of `undo-list-transfer-to-tree' into the following `let' form:

(let ((gc-cons-threshold most-positive-fixnum)) ...)

Toby Cubitt

9 January 2020 @Alexander #1:

Thanks for the kind comment. The undo-limit non-bug should definitely be "fixed" now, but let's see if the gc-related changes really fix the corruption issues… I still don't have a way of reproducing the issues myself, so I'm only guessing gc is implicated here.

Your gc-cons-threshold suggestion is a good one, and much simpler than what I've implemented. I did consider doing something like that, once I suspected a GC race.

However, increasing gc-cons-threshold strictly speaking just masks the GC race issue; it doesn't fix it properly. If you were really, really unlucky, the GC limit could still be breached during copying, and GC could still end up being triggered at a bad time. Still, it's true that letting gc-cons-threshold to most-positive-fixnum would completely disable GC in practice. You'd need to cons multiple exa bytes of data before GC would get triggered!

But the fix I've implemented should make undo-list-transfer-to-tree completely immune to GC races, regardless of when GC is triggered. Since it's possible to implement a full fix for this, it seems like the right thing to do. Also, I'm slightly wary of (effectively) temporarily disabling GC. The undo-tree data structures are very large by Elisp standards (cf. https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27779). On resource-constrained machines, it's conceivable that disabling GC whilst copying around large amounts of undo-tree data might blow the Emacs c stack limit.

It might be worth adding the temporary gc-cons-threshold increase in addition to the other measures, as a belt-and-braces approach. But let's see how things go. If people still report undo-tree corruption issues despite the measures I've implemented, it would likely mean I'm wrong about GC race conditions being the cause in the first place.

Damien Cassou

17 January 2020 Thank you so much for your package (that I've been using for a decade without any trouble and with much satisfaction) and for this very informative blog post. The only thing I have sometimes missed is undoing restricted to the region.

Thank you again. Amazing package!

Toby Cubitt

17 January 2020 @Damien Cassou #3: Very glad to hear you find it so useful!

Conceptually, there's no problem making undo-in-region work with undo-tree. Unlike undoing and redoing, which just move to previously-existing buffer states, undo-in-region creates new buffer states that never existed before. This just implies creating new branches of the undo-tree to represent these buffer states.

This is exactly what the undo-in-region implementation in the undo-tree package attempts to do. But the current implementation still needs polishing, so it's disabled by default. You can enable it at your own risk by setting undo-tree-enable-undo-in-region, but expect to hit some bugs. (Please report them if you can find a recipe for reproducing them!) As always, the problem with undo bugs is that undo is highly stateful, so reproducing a bug again is typically very challenging.

Andrew

23 January 2019 Thanks for the interesting read! I've tried to test the new version of undo-tree, and seems that unfortunately the undo-tree-auto-save-history functionality was broken - I keep getting the following error when saving the buffer:

Debugger entered--Lisp error: (wrong-type-argument hash-table-p nil)
  puthash(undo-tree-id0 # nil)
  undo-tree-move-GC-elts-to-pool((# . -1))
  #f(compiled-function () #)()
  apply(#f(compiled-function () #) nil)
  undo-list-transfer-to-tree()
  undo-tree-save-history(nil t)
  undo-tree-save-history-hook()
  run-hook-with-args-until-success(undo-tree-save-history-hook)
  basic-save-buffer(t)
  save-buffer(1)
  funcall-interactively(save-buffer 1)
  call-interactively(save-buffer nil nil)
  command-execute(save-buffer)

There's also other reports of this error over the internet (e.g. on reddit). Do you mind looking into that?

Toby Cubitt

26 January 2020 @Andrew #6: You don't say which vesion of undo-tree you experienced this on. Exactly this error was reported already (see comments at undo-tree.html), and I think is already fixed in the latest version (now at 0.7.3). Try updating and let me know how it goes.

Andrew

27 January 2020 Turns out I had 0.7.2 version, and the issue is indeed fixed in 0.7.3. Thanks a lot!

Leave a comment

All comments are moderated. Clicking submit will open your email client and let you send your comment by email. By submitting your comment you agree to license the content under a Creative Commons Attribution-ShareAlike 4.0 International License.




Creative Commons License