Here is the latest from Julia Computing
BL
G

FemtoCleaner - A bot to automatically upgrade your Julia syntax

17 Aug 2017 | Keno Fischer

TL;DR: FemtoCleaner is a GitHub bot that upgrades old Julia syntax to new syntax. It has been installed in more than 700 repositories, submitted 100+ pull requests and touched 10000 lines of code since last Friday. Scroll down for instructions, screen shots and pretty plots.

Background

As julia is approaching its 1.0 release, we have been revisiting several key areas of the language. We want to ensure that the 1.0 release is of sufficient quality that it can serve as a stable foundation of the Julia ecosystem for many years to come without requiring breaking changes. In effect, however, prioritizing such breaking changes over ones that can be safely done in a non-breaking fashion after 1.0 means that we are currently making many more breaking changes than we otherwise might. Two particularly disruptive such changes were the syntax changes to type keywords and parametric function syntax, both of which were introduced in 0.6. The old syntax is now deprecated on master will be removed in 1.0. The former change involves changing the type definition keywords from type/immutable to mutable struct/struct, e.g.

immutable RGB{T}
    r::T
    g::T
    b::T
end

becomes

struct RGB{T}
    r::T
    g::T
    b::T
end

The parametric function syntax change is a bit more tricky. In the simplest case, it involves rewriting functions like:

eltype{T}(::Array{T}) = T

to

eltype(::Array{T}) where {T} = T

which is relatively straightforward. However, there are more complicated corner cases involving inner constructors such as:

immutable Wrapper{T}
    data::Vector{T}
    Wrapper{S}(data::Vector{S}) = new(convert(Vector{T}, data))
end

which now has to read

struct Wrapper{T}
    data::Vector{T}
    Wrapper{T}(data::Vector{S}) where {T,S} = new(convert(Vector{T}, data))
end

This last example also shows why this syntax was changed. In prior versions of julia, the braces syntax (F{T} for some F,T) was inconsistent between meaning parameter application and introducing parameters for a method. Julia 0.6 features a significantly more powerful (and correct) type system. At the same time, the F{T} syntax was changed to always mean parameter application (modulo support for parsing the deprecated syntax for backwards compatibility of course), reducing confusion and making it possible to more easily express some of the new method signatures now made possible by the new type system improvements. For further information see also Stefan’s Discourse post and the 0.6 release notes.

Realizing the magnitude of the required change and the growing amount of Julia code that exists in the wild, several julia contributors suggested on Discourse that we attempt to automate these syntax upgrades. Unfortunately, is not simply a of search/replace in a source file. The rewrite can be quite complex and depends on the scope in which it is used. Nevertheless, we set out to build such an automated system, with the following goals in mind:

  • Correctness - Being able to upgrade syntax is not very useful if we have to go in and clean up after the automated process’ mistakes, it probably would have been faster to just do it ourselves in the first place.
  • Style preservation - Many programmers carefully write their code in their own preferred style and we should try hard to preserve such choices whenever possible (otherwise people might not want to use the tool)
  • Convenience - Ideally no setup would be required to use the tool

CSTParser

The first goal, correctness, forces us to use a proper parser for our undertaking, rather than relying on simple find/replace or regular expressions. Unfortunately, while julia’s parser is accessible from within the language and can be used to find these instances of deprecated syntax, it cannot be used for our purposes. This is because it does not support our second goal - style preservation. In going from the input text to the Abstract Syntax Tree, the parser discards a significant amount of semantically-irrelevant information (formatting, comments, distinctions between different syntax forms that are otherwise semantically equivalent). Instead, we need a parser that retains and exposes all of this information. There are several names of this concept, “round-tripable representation”, “Concrete Syntax Tree (CST)” or “Lossless Syntax Tree” being perhaps the most common. Luckily, in the Julia ecosystem we have not one, but two choices for such a parser:

  • JuliaParser.jl - a slightly older translation of the scheme parser from the main julia codebase into Julia, later retrofitted with precise location information.
  • CSTParser.jl - a ground up rewrite of the parser with the explicit goal of writing a high performance, correct, lossless parser, originally for use in the VS Code IDE extension

Ultimately the decision came down to the fact that CSTParser.jl was actively maintained, while JuliaParser.jl had not yet been updated to the new Julia syntax. With a number of small enhancements and additional features I contributed in order to make it useful for this project, CSTParser is now able to parse essentially all publicly available Julia code correctly, while retaining the needed formatting information.

The design of CSTParser.jl is somewhat similar to that of the Roslyn parser (a good overview can be found here). Each leaf node in the AST stores only its total size (but not its absolute position in the file), as well as what part of its contents are semantically significant as opposed to leading or trailing trivia (comments, whitespace, semicolons etc). This is useful for the IDE use case, since it allows efficient reparsing when small changes are made to a file (since a local change does not invalidate any data in a far away node). The resulting tree can be a little awkward to work with, but as we shall see it is easy to work around this for our use case.

Deprecations.jl

The new Deprecations.jl package is the heart of this project. It contains all the logic to rewrite Julia code making use of deprecated syntax constructs. It supports two modes of specifying such rewrites:

  • Using CST matcher templates
  • By working with the raw CST api Independent of the mode, a new rewrite is introduced as such:
      struct OldStructSyntax; end
      register(OldStructSyntax, Deprecation(
          "The type-definition keywords (type, immutable, abstract) where changed in Julia 0.6",
          "julia",
          v"0.6.0",
          v"0.7.0-DEV.198",
          typemax(VersionNumber)
      ))
    

    which gives a description, as well as some version bounds. This is important because we need to make sure to only apply rewrites that are compatible with the package’s declared minimum supported julia version (i.e. we need to make sure not to introduce julia 0.6 syntax to a package that still supports julia 0.5). Each Julia package provides a REQUIRE file specifying it’s supported minimum versions.

Having declared the new rewrite, let’s actually make it work by adding some CST matcher templates to it:

    match(OldStructSyntax,
        "immutable \$name\n\$BODY...\nend",
        "struct\$name\n\$BODY!...\nend",
        format_paramlist
    )
    match(OldStructSyntax,
        "type \$name\n\$BODY...\nend",
        "mutable struct\$name\n\$BODY!...\nend",
        format_paramlist
    )

The way this works is fairly straightforward. For each match call, the first line is the template to match and the second is its replacement. Under the hood, this works by parsing both expressions, pattern matching the resulting template tree against the tree we want to update and then splicing in the replacement tree (with the appropriate parameters taken from the tree we’re matching against). The whole thing is implemented in about 200 lines of code.

In this description I’ve skipped a bit of magic. Simply splicing together a new tree of CST nodes, doesn’t quite work. As mentioned above the CST nodes only know their kind and size and very little else. In particular, they know neither their position in the original buffer, nor what text is at that position. Instead, the replacement tree is made out of different kind of node that retains both pieces of information (which the original buffer is and where in the buffer that node is located). Conceptually this is again similar to Roslyn’s red-green trees. However, there is very little code associated with this abstraction. Most of the functionality is provided by the AbstractTrees.jl package by lifting the tree structure of the underlying CST nodes.

Lastly, there’s a couple of other node types to be found in this “replacement tree” to insert or replace whitespace or other trivia. This is useful for formatting purposes. E.g. the example above, we passed format_paramlist as a formatter. This function runs after the matching and adjusts formatting. To see this consider:

immutable RGB{RT,
              GT,
              BT}
    r::RT
    g::GT
    b::BT
end

Naively, this would end up as

struct RGB{RT,
              GT,
              BT}
    r::RT
    g::GT
    b::BT
end

leaving us with unhappy users. Instead, the formatter turns this into

struct RGB{RT,
           GT,
           BT}
    r::RT
    g::GT
    b::BT
end

by adjusting the leading trivia of the GT and BT nodes (or rather the trailing trivia of their predecessors).

Lastly, while the CST templates shown above are powerful, they are still limited to simple pattern matching. Sometimes we need to perform more complicated kinds of transformation to decide which rewrites to perform. One example is code like:

if VERSION > v"0.5"
    do_this()
else
    do_that()
end

which, depending on the current julia version, executes either one branch or the other. Of course, if the package declares that it requires julia 0.6 at a minimum, the condition is true for any supported julia version, so we can “constant fold” the expression and remove the else branch. Doing so with simple templates is infeasible, since we need to recognize all patterns of the form “comparison of VERSION against some version number” and then compute whether the condition is actually always true (or false) given the declared version bounds. Such transformations are possible using the raw API. Writing such transformations is more complicated (and beyond the scope of this post), but can be very powerful.

FemtoCleaner

Having addressed the first two goals, let’s get to the third goal - convenience. The vast majority of public Julia code is hosted on GitHub, so the natural way to do this is create a GitHub bot that clones a repository, applies the rewrites and submits a pull request to the relevant repository. The simplest way to do would be to clone all the repositories, apply the rewrites, and then programmatically submit pull requests to all of them (the PkgDev.jl packages has a function to automatically submit a pull request against a Julia package). However, this approach falls short for several reasons:

  • It’s very manual. When new features are added, we have to manually perform a new such run. This is also problematic, because in practice it means that these runs have to always be done by the person who knows how the setup works. He’s a very busy guy.
  • It would only catch registered Julia packages. There are a significant number of repositories that use Julia code, but are not registered Julia packages. Of course one could go the other way and submit pull requests to repositories that look like Julia code, but that risks creating a significant number of useless pull request (because of forks, long dead codebases, etc)
  • It wouldn’t work on private packages
  • It doesn’t allow the user to control and interact with the process

A better alternative that addresses all these problems is to create a GitHub bot (also called a GitHub app) to perform these functions. The Julia community is quite familiar with these already. We have the venerable nanosoldier, which performs on-demand performance benchmarking of julia commits, attobot which assists Julia users in registering their packages with METADTA and (perhaps less well known) jlbuild which controls the julia buildbots (which build releases and perform some additional continous integration on secondary platforms).

Joining these now is femtocleaner (phew that took a while to get to - I hope the background above was useful though), which performs exactly this function. Let’s see how it works. First go to https://github.com/apps/femtocleaner and click “Configure”. You’ll be presented with a choice of accounts to install femtocleaner into:

Choosing an account will give you the option to install femtocleaner on either one or all of the repositories in that account:

In this case, I will install femtocleaner in all repositories of the JuliaParallel organization. Without any further ado, femtocleaner will go to work, cloning each repository, applying the rewrites it knows about and then submitting a pull request to each repository where it was able to make a change:

From now on, FemtoCleaner will listen to events on these repositories and submit another pull request whenever these packages decide to drop support for an older julia version, thus allowing the bot remove more deprecated syntax. The bot can also be triggered manually by opening an issue with the title “Run femtocleaner”.

The bot has a few additional features meant to make interacting with it easier. The most used one is the bad bot command, which is used to inform the developers that the bot has made a mistake. It can be triggered by simply creating a “Changes Requested” GitHub PR review, and annotating an incorrect change with the review comment bad bot, like so:

In response the bot will open an issue on its source code repository giving the relevant context and linking back to the PR:

Enabling this functionality right from the the pull request review window has proven very powerful. Rather than requiring the user to leave their current context (reviewing a pull request) and navigate to a different repository to file an issue, everything can be done right there in the pull request review. Lastly, once the rewrite bug has been addressed, the bot will come back, update the pull request and leave a comment to inform the user it did so:

This workflow is also very convenient from the other side. All the issues are in one place (rather than having to monitor activity on all pull requests filed by the bug) and addressing the bug is as simple as fixing the code and pushing it to the source code repository. The bot will automatically, update itself and go back and fix up any pull requests that would now differ as a result of the new code:

Results

The whole project from the first line of code written in support of it until this blog post (which represents its completion) took about three weeks. As part of it, I made a number of changes to CSTParser and its dependencies (which should prove very useful for future parsing endeavors) as well as GitHub.jl (which will hopefully help write more of these kinds of bots to support the Julia community). After some initial testing and an alpha run on JuliaStats on Aug 8 (huge thanks to Alex Arslan for aggreeing to diligently review and try out the process), we announced the public availability of femtocleaner on discourse last friday (Aug 11). Since then, the bot has been installed on 759 repositories (though about 200 of them were ineligible for femtocleaner processing, either because they had missing or malformed REQUIRE files or because they were not actually Julia packages), submitting 132 pull requests that add 8850 lines and delete 9331. Most of these pull requests have been merged:

As people started using femtocleaner, a number of issues were discovered, but developers took advantage of the bad bot mechanism described above to report them and we did our best to address them quickly. The following graph shows the number of open/closed such issues over the time period that femtocleaner has been active:

Alex Arslan’s original testing on Aug 8 is well visible (and took a few days to catch up to), but all known issues have been addressed. Another interesting data point is the distribution of supported julia versions that femtocleaner was installed on. As discussed above, it was primarily written to aid in moving to the new syntax available in julia 0.6, though a few rewrites (such as the generic VERSION comparisons) are also applicable to older versions. The following shows the number of repositories as well as the number of open prs by minimum supported julia version (no pr opened means that the bot found no deprecated syntax):

As expected, packages supporting 0.6.0 got proportionally the most pull requests. However, this just means that femtocleaner will be back for the remaining 0.5.0 packages once they decide to drop support for 0.5.0. We can also look at the number of changed lines by the package’s supported minimum version:

Again the bias of the bot for upgrading 0.6 syntax stands out. It is perhaps interesting to note that most of the 0.6 packages with a small to medium number of changes had already been upgraded manually to the new syntax. Still, the bot was able to find a few changes that were missed in this process and clean them up automatically.

Conclusions

Overall, this work should accelerate the movement of the package ecosystem towards 1.0 by making upgrading code easier. Generally, the package ecosystem lags behind the julia release by a few months as package maintainers upgrade their code bases. We hope this system will help make sure that 1.0 is released with a full set of up-to-date packages, as well as ease the burden on package maintainers, allowing them to spend their time on improving their packages instead of being forced to spend a lot of time performing tedious syntax upgrades. We are very happy with the outcome of this work. There are already almost ten thousand fewer deprecation warnings across the Julia ecosystem and more will be removed automatically once the package developers are ready for it. Additionally, the underlying technology should help with a number of other developer-productivity tools and improvements, such as IDE support, better error messages and the debugger. All code is open source and available on GitHub. You are welcome to contribute, improve the code or build your own GitHub bots.

We would like to thank GitHub for providing a rich enough API to allow this convenient workflow.

Lastly, we thank and acknowledge the Sloan foundation for their continued supported of the Julia ecosystem by providing the funding for this work.

Recent posts

Analyzing LendingClub data using JuliaDB
22 Aug 2017 | Venkat Subramaniam and Shashi Gowda
Algorithmic trading with Julia
22 Aug 2017 | Avik Sengupta and Simon Byrne
FemtoCleaner - A bot to automatically upgrade your Julia syntax
17 Aug 2017 | Keno Fischer
Get the latest news about Julia delivered to your inbox.
Need help with Julia?
We provide products, training and consulting to make Julia easy to use, easy to deploy and easy to scale in your organization. Email us: [email protected]
Contact us
Julia Computing, Inc. was founded with a mission to make Julia easy to use, easy to deploy and easy to scale. We operate out of Boston, New York, London, Bangalore, San Francisco, Los Angeles and Washington DC and we serve customers worldwide.
© 2015-2017 Julia Computing, Inc. All Rights Reserved.