Extending TreeScanner

If you want to analyze Java source code, you can get access to them by first parsing the .java files and then extending the TreeScanner class to walk the parse tree. TreeScanner is a class in com.sun.source.util and as this is an exported package, you can work with the public classes in it without problems.

Extending TreeScanner allows you to provide override methods for the scan method, the reduce method and all the visit methods, and that is exactly what you need to retrieve useful information from Java’s abstract syntax trees.

Traversal of the Abstract Syntax Tree

In the previous blog post we saw how the parse tree is being traversed. The flow oscillates between TreeScanner and the static subclasses of JCTree, summarized in the line written below, whereby scan and visit are found in TreeScanner and accept and get in the JCTree subclass:

scan(Tree) -> accept(Visitor) -> visit(Tree) -> node.get() -> ...

In Java’s implementation, the scan method has two overloads, one being used if the first argument is of type Tree and one being used if the first argument is of type Iterable<? extends Tree>. This overload is required because the typical ‘get’ methods of the different Tree types either return a single Tree object or an iterable of Tree objects.

Furthermore there is a scanAndReduce method, which also exist in two variants, one for a single Tree object and one for an iterable of Tree objects. The scanAndReduce is being introduced because the Java implementation allows for the generation of a return value of scan that aggragates the return values of all trees being visited. This scanAndReduce method is a wrapper around the scan method and has the reduce method in it as well. I will get back to this.

Parsing files

If you want to extract information from a parse tree, you first need to parse the files you are interested in. This is code for it:

  this.compiler = ToolProvider.getSystemJavaCompiler(); // gets the implemented compiler via a sort of service principle

  this.listJavaFileObjects = javaFiles.stream().map(x->x.getJavaFileObject()).toList(); // code specific for my project

  // create a task. The last argument is the list of files to parse
  this.task = (JavacTask) compiler.getTask(
        null,
        null,
        null,
        List.of("-proc:none"), // disable annotation processing
        null,
        listJavaFileObjects
  );

  // this.cuTrees is an iterable of CompilationUnitTree objects
  // task.parse creates AST's, analyze does subsequent steps, generating Symbol and Type objects
  try {
      this.cuTrees = task.parse();
      task.analyze();
  } catch (IOException e) {
      throw new RuntimeException(e);
  }

  // This trees object is necessary if you want to work with Symbols and Types
  this.trees = Trees.instance(task);

If you only want to work with the parsed tree, and are not interested in Symbol and Type objects, the result that you need from the code above is only the iterable of CompilationUnitTree (Iterable<? extends CompilationUnitTree>). You can omit task.analyze and this.trees = Trees.instance(task);. We will discuss them in upcoming blog posts.

Extending TreeScanner

TreeScanner is the class that orchestrates the traversal of the AST’s. If you create an instance of it and call scan on it with the CompilationUnitTree object of your file as first argument, all Tree nodes will be visited. As the visit methods do not do anything with the Trees, it does not give you any result. If you do want a result, ie extraact information from the AST, you need to create a subclass of TreeScanner and overwrite its methods. The candidate methods to overwrite are scan, visit and reduce.

This is how an override looks like:

public class MyCustomScanner extends TreeScanner<Void,Void> {
    
    // insert overrides here
}

Overriding a visit method

TreeScanner contains many visit methods and you can override one or more of them to insert custom code. This is a very basic example:

    @Override
    public Void visitClass(ClassTree node, Void p) {
        System.out.println("Class: " + node.getSimpleName());
        return super.visitClass(node, p);
    }

The interesting thing is the return statement. In the previous blog post we saw that traversal of the tree required that the visit methods contained ‘get’ methods to provide new child trees to scan. If you simply override a visit method without keeping this get methods intact, the traversal will not get any deeper into the child nodes which is often not what you want. The return statement calls the implementation of the parent method in TreeScanner, and as that parent method has all the get methods, traversal will proceed as normal. Basically the only thing that is added to the traversal process is the line System.out.println("Class: " + node.getSimpleName());.

Void

A remarkable element in this overridden method is the return type (Void) and the second argument, Void p. Void indicates that the method will return null, which is not the same as returning nothing at all (void in lowercase). This use of Void is a design choice, and aligns with the fact that bottom leafs (Tree objects without children) have a visit method that returns null, like these ones:

    public R visitIdentifier(IdentifierTree node, P p) {
        return null;
    }

    public R visitEmptyStatement(EmptyStatementTree node, P p) {
        return null;
    }

Overriding the scan method

Below is an overridden scan method. Like the overridden visit method, it has a return statement with ‘super.scan’ in it, which is just meant to not interrupt the traversal process. The getKind method is one of two methods found in interface Tree, the other being accept. These are the methods applicable to any Tree, while child interfaces have extra methods fitting their specific role and characteristics.

    @Override
    public Void scan(Tree tree, Void p) {
        if (tree != null) {
            System.out.println(tree.getKind() + ":" + tree.getClass().toString());
        }
        return super.scan(tree, p);
    }

Overriding the reduce method

The reduce method can as well be overridden. The default is this, it does not reduce anything as r2 is not used:

    public R reduce(R r1, R r2) {
        return r1;
    }

To understand what the role is of the reduce method, let’s look at the scanAndReduce method for single Tree objects:

    private R scanAndReduce(Iterable<? extends Tree> nodes, P p, R r) {
        return reduce(scan(nodes, p), r);
    }

The second value is the aggregated value of previous scans, while the first is the value returned by the current scan.

If you override, you must account for the fact that any of the two arguments can be null. ChatGPT advised me to start any override of reduce like this:

    if (r1 == null) return r2;
    if (r2 == null) return r1;

This implicates that the return value of reduce is only null if both r1 and r2 are null. It also means that no matter how many of the scan and scanAndReduce methods return null, it is still possible to get a non-null aggragate result.

To get a non-null result, it is required to tell Java what the return type must be, ie provide a specific type for the generic type parameter R. Furthermore it is required to provide code in the overridden reduce method that combines r1 and r2. Lastly, at least one of the visit methods must return a non-null value and have a signature with a type equivalent to the return type of the reduce method.

Example

The code below shows an example of an ScanC class that extends TreeScanner and manages to collect an aggregate result. This aggregate result has the form of a list of Strings, whereby the strings are the names of the methods in the CompilationUnitTree that is being traversed.

public class ScannerC extends TreeScanner<List<String>,Void> {

    @Override
    public List<String> reduce(List<String> r1, List<String> r2){

        if (r1 == null) return r2;
        if (r2 == null) return r1;

        r1.addAll(r2);
        return r1;
    }

    @Override
    public List<String> visitMethod(MethodTree node, Void p){

        super.visitMethod(node, p);  // to keep traversal going
        return new ArrayList<>(Arrays.asList(node.getName().toString()));
    }
}

Be aware that List<String> must be the first type parameter, as the first type parameter is R. Note that this is the signature of TreeScanner:

public class TreeScanner<R,P> implements TreeVisitor<R,P> {

List<String> must as well be chosen as type for the return value and the two arguments of reduce, and as return method of the visit method that you overwrite.

Collecting data: R or side-effect

There are two ways to retrieve data from a parse tree. You can collect it with the generic R parameter, using reduce similarly to the example given above. An easier way is to not override the reduce method, keep the return values of scan and/or visit Void, and add lines in either the scan or visit method(s) that mutate an instance variable in your custom subclass that extends TreeScanner.

Below is an example where information retrieved by the scan method is added to an instance field of type List<Tree.Kind>:

public class ScannerA extends TreeScanner<Void,Void> {

    List<Tree.Kind> collectedTreeKinds = new ArrayList<>();

    @Override
    public Void scan(Tree tree, Void p) {
        if (tree != null) {

            collectedTreeKinds.add(tree.getKind());
        }
        return super.scan(tree, p); // walk children
    }
}

# result for a single tiny .java file:
[COMPILATION_UNIT, PACKAGE, MEMBER_SELECT, MEMBER_SELECT, IDENTIFIER, INTERFACE, MODIFIERS, METHOD, MODIFIERS, PRIMITIVE_TYPE]

The collection of values is not done via R but as a side effect of the visit method. The code that starts the process looks like this:

    ScannerA scannerA = new ScannerA();
    scannerA.scan(cu, null);
    System.out.println(scannerA.collectedTreeKinds);

Another possibility is to overwrite one or more visit methods:

public class ScannerC extends TreeScanner<Void,Void> {

    List<String> collectedValues = new ArrayList<>();

    @Override
    public Void visitMethod(MethodTree node, Void p){

        String returnType = "void";

        String methodName = node.getName().toString();
        if (node.getReturnType() != null){
            returnType=node.getReturnType().toString();
        };
        collectedValues.add(methodName + ": returns " + returnType);

        return super.visitMethod(node, p);
    }
}

# result for small class:
[<init>: returns void, getSomeString: returns String, greet: returns void]

ChatGPT told me that collecting values using side effect is more common than working with R, and I understand why. You do not have to bother about the reduce method, about setting the generic types correctly and you can collect as many items of as many types as you want.

The generic P parameter

Not only is there a generic R that can be used to collect aggragate values, there is also a generic P that can be used to provide context. This P is found all around, except for the reduce method:

# The TreeScanner class declaration
public class TreeScanner<R,P> implements TreeVisitor<R,P> 

# scan method
    public R scan(Tree tree, P p) {
        return (tree == null) ? null : tree.accept(this, p);
    }

# scanAndReduce
    private R scanAndReduce(Tree node, P p, R r) {
        return reduce(scan(node, p), r);
    }

# all visit methods
    @Override
    public R visitCompilationUnit(CompilationUnitTree node, P p) {
        R r = scan(node.getPackage(), p);
        r = scanAndReduce(node.getImports(), p, r);
        r = scanAndReduce(node.getTypeDecls(), p, r);
        r = scanAndReduce(node.getModule(), p, r);
        return r;
    }

Use of P

When retrieving information from the AST, you might want to differentiate the interpretation of information based on some contextual factor. For example, you want to count the number of all method invocations, excluding those that are part of lambda expressions. Or you want to have a list of field names that are declared but not initialized.

Below I have created an example, not great but it illustrates a bit. Actually I do not have good ideas about how and where I would use this P parameter but at least this works as intended:

enum Context { NORMAL, SPECIAL }

public class ScannerD extends TreeScanner<Void,Context>{

    @Override
    public Void visitCompilationUnit(CompilationUnitTree node, Context context) {

        context =  Context.NORMAL;
        return super.visitCompilationUnit(node, context);
    }

    @Override
    public Void visitBlock(BlockTree node, Context context) {

        System.out.println("Entering a block");
        return super.visitBlock(node, Context.SPECIAL);
    }

    public Void visitIdentifier(IdentifierTree node, Context ctx){

        System.out.println("  " + node.getName().toString() + ": " + ctx);
        return null;  // this is the default for bottom leaf
    }
}

What this code does is that identifiers in blocks (method bodies, initializers) are treated differently from identifiers not in blocks. This is the output for a simple class:

  com: NORMAL
  java: NORMAL
  java: NORMAL
  com: NORMAL
  Greeting: NORMAL
  String: NORMAL
  Integer: NORMAL
  KustomClass: NORMAL
  String: NORMAL
  Integer: NORMAL
  KustomClass: NORMAL
Entering a block
  super: SPECIAL
  this: SPECIAL
  argument1: SPECIAL
  this: SPECIAL
  argument2: SPECIAL
  this: SPECIAL
  myVar: SPECIAL
  String: NORMAL
Entering a block
  String: SPECIAL
  String: SPECIAL
  String: SPECIAL
  justSomeString: SPECIAL
  justAnotherString: SPECIAL
  someString: SPECIAL
  String: NORMAL
Entering a block
  List: SPECIAL
  Integer: SPECIAL
  ArrayList: SPECIAL
  Integer: SPECIAL
  myList: SPECIAL
  x: SPECIAL
  System: SPECIAL
  greeting: SPECIAL

I have not figured out how to insert the line ‘Exiting a block’, I might need to work with R for that. Anyhow, it gives a bit of an idea of what to use P for.

Next topics will be Symbols and Types.

Visitor and Abstract Syntax Tree

Forbidden packages