The Heart of a Language Server

@matklad, Dec 26, 2023

In this post, I want to expand on one curious comment from rust-analyzer code base. You can find the comment here.

It describes a curious recursive algorithm that is repeated across different language-server-shaped thing: I’ve seen it implemented for Kotlin and C#, and implemented it myself for Rust.

Here’s a seemingly random grab bag of IDE features:

Go to definition
Code completion
Run test at the cursor
Extract variable

What’s common among them all? All these features are relative to the current position of the cursor! The input is not only the state of the code at a given point in time, but a specific location in the source of a project, like src/main.rs:90:2.

And the first thing a language server needs to do for any of the above features is to understand what is located at the given offset, semantically speaking. Is it an operator, like +? Is it a name, like foo? If it is a name, in what context a name is used --- does it define an entity named foo or does it refer to a pre-existing entity? If it is a reference, then what entity is referenced? What type is it?

The first step here is determining a node in the syntax tree which covers the offset. This is relatively straightforward:

1
2
3
4
5
6
7
fn node_at_offset(node: SyntaxNode, offset: u32) -> SyntaxNode {
    assert!(node.text_range().contains(offset));
    node.children()
        .find(|it| it.text_range().contains(offset))
        .map(|it| node_at_offset(it, offset))
        .unwrap_or(node)
}

But the syntax tree by itself doesn’t contain enough information to drive IDE features. Semantic analysis is required.

But the problem with semantic analysis is that it usually involves several layers of intermediate representations, which are only indirectly related to the syntax tree. While the syntax tree is relatively uniform, and it is possible to implement a generic traversal like the one above, semantic information is usually stored in a menagerie of ad-hoc data structures: trees, graphs, and plain old hash tables.

Traditional compilers attach source span information to semantic elements, which could look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
struct Span {
    file: PathBuf,
    line: u32,
    column: u32,
}

struct LocalVariable {
    name: InternedString,
    mutability: Mutability,
    ty: Type,
    span: Span
}

With line information in place, it is possible for a language server to find an appropriate semantic element for a given cursor position: just iterate all semantic elements there are, and find the one with the smallest span which still contains the cursor.

This approach works, but has two drawbacks.

The first drawback is that it’s too slow. To iterate over all semantic elements, an entire compilation unit must be analyzed, and that’s too slow, even if done incrementally. The core trick of a performant language server is that it avoids any analysis unless absolutely necessary. The server knows everything about the function currently on the screen, and knows almost nothing about other functions.

The second drawback is more philosophical --- using text spans erases information about the underlying syntax trees. A LocalVariable didn’t originate from a particular span of text, it was created using a specific node in the concrete syntax tree. For features like "go to definition", which need to go from syntax to semantics, the approximation turns out to be good enough. But for refactors, it is often convenient to go in the opposite direction --- from semantics to syntax. To change a tuple enum to a record enum, a language server needs to find all usages of the enum in the semantic model, but then it needs to modify the syntax tree. And going from a Span back to the SyntaxNode is not straightforward: different syntax nodes might have the same span!

For example, a foo is a:

name token
a reference
a trivial path (foo::bar)
and a path expression

1
2
3
4
5
PATH_EXPR@20..23
  PATH@20..23
    PATH_SEGMENT@20..23
      NAME_REF@20..23
        IDENT@20..23 "foo"

Iterative Recursive Analysis

So, how can a language server map syntax nodes to corresponding semantic elements, so that the mapping is precise and can be computed lazily?

First, every semantic element gets a source_syntax method that returns the original syntax node:

1
2
3
impl LocalVariable {
    pub fn source_syntax(&self) -> SyntaxNode
}

The method is implemented differently for different types. Sometimes, storing a reference to a syntax node is appropriate:

1
2
3
4
5
6
7
8
9
struct LocalVariable {
    source_syntax: SyntaxNodeId,
}

impl LocalVariable {
    pub fn source_syntax(&self) -> SyntaxNode {
        node_id_to_node(self.source_syntax)
    }
}

Alternatively, the syntax might be computed on demand. For example, for local variables we might store a reference to the parent function, and the ordinal number of this local variable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
struct LocalVariable {
    parent: Function,
    ordinal: usize
}

impl LocalVariable {
    pub fn source_syntax(&self) -> SyntaxNode {
        let parent_function_syntax = self.parent.source_syntax()
        parent_function_syntax
            .descendants()
            .filter(|it| {
                it.kind == SyntaxNodeKind::LocalVariable
            })
            .nth(self.ordinal)
            .unwrap()
    }
}

Yet another pattern is to get this information from a side table:

1
type SyntaxMapping = HashMap<LocalVariable, SyntaxNode>

In rust-analyzer all three approaches are used in various places.

This solves the problem going from a semantic element to a syntax, but what we’ve started with is the opposite: from an offset like main.rs:80:20 we go to a SyntaxNode, and then we need to discover the semantic element. The trick is to use the same solution in both directions:

To find a semantic element for a given piece of syntax:

Look at the parent syntax node.
If there is no parent, then the current syntax node corresponds to an entire file, and the appropriate semantic element is the module.
Otherwise, recursively lookup semantics for the parent.
Among all parent’s children (our siblings), find the one whose source syntax is the node we started with

Or, in pseudocode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
fn semantics_for_syntax(node: SyntaxNode) -> SemanticElement {
    match node.parent() {
        None => module_for_file(node.source_file),
        Some(parent) => {

            // Recursive call
            let parent_semantics = semantics_for_syntax(parent);

            for sibling in parent_semantics.children() {
                if sibling.source_syntax() == node {
                    return sibling
                }
            }
        }
    }
}

In this formulation, a language server needs to just enough analysis to drill down to a specific node.

Consider this example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
struct RangeIter {
    lo: u32,
    hi: u32,
}

impl Iterator for RangeIter {
    type Item = u32;

    fn next(&mut RangeIter) -> Item {
                            //  ^ Cursor here

    }
}

impl RangeIter {
    ...
}

Starting from the Item syntax node, the language server will consider:

the return type of the function next,
the function itself,
the impl Iterator block,
the entire file.

Just enough semantic analysis will be executed to learn that a file has a struct declaration and two impl blocks, but the contents of the struct and the second impl block won’t be inspected at all. That is a huge win --- typically, source files are much more wide than they are deep.

This recursion-and-loop structure is present in many language servers. For rust-analyzer, see the source_to_def module, with many functions that convert syntax (ast:: types) to semantics (unqualified types).

1
2
3
4
fn type_alias_to_def(
    &mut self,
    src: InFile<ast::TypeAlias>,
) -> Option<TypeAliasId> {

For Roslyn, one entry point to the machinery is GetDeclaredType function. BaseTypeDeclarationSyntax is, well, syntax, while the return type NamedTypeSymbol is the semantic info. First, Roslyn looks up semantic info for syntactic parent, using GetDeclaredTypeMemberContainer. Then, in GetDeclaredMember it iterates semantic siblings and finds the one with the matching text range.

For Kotlin, the entry is findSourceFirDeclarationByExpression. This function starts with a syntax node (KtExpression is syntax, like all Kt nodes), and returns a declaration. It uses getNonLocalContainingOrThisDeclaration to get syntactic container for a current node. Then, findSourceNonLocalFirDeclaration gets Fir for this parent. Finally, findElementIn function traverses Fir children to find one with the same source we originally started with.

Limitations

There are two properties of the underlying languages which make this approach work:

Syntactic nesting must match semantic nesting. Looking at parent’s sibling makes sense only if the current element should be among the siblings.
Getting sematic element for an entire file is trivial.

The second one is actually less true in Rust than it is in Kotlin or C#! In those languages, each file starts with a package declaration, which immediately mounts the file at the appropriate place in the semantic model.

For Rust, a file foo.rs only exists semantically if some parent file includes it via mod foo; declaration! And, in general, it’s impossible to locate the parent file automatically. Usually, for src/bar/foo.rs the parent would be src/bar.rs, but, due to #[path] attributes which override this default, this might not be true. So rust-analyzer has to be less lazy than ideal here --- on every change, it reconstructs the entire module tree for a crate looking at every file, even if only a single file is currently visible.

Here’s another interesting example:

1
2
3
mod ast {
    generate_ast_from_grammar!("FooLang.grm");
}

Here, we have a hypothetical procedural macro, which reads a grammar definition from an external file, and presumably generates a bunch of Rust types for the AST described by the grammar. One could dream of an IDE where, without knowing anything specific about .grammar, it can still find usages of AST nodes defined therein, using the span information from the procedural macro. This works in theory: when the macro creates Rust token trees, it can manufacture spans that point inside FooLang.grm, which connects Rust source with the grammar.

Where this breaks down is laziness. When a user invokes "find usages" inside FooLang.grm, the IDE has no way of knowing, up-front, that the generate_ast_from_grammar!("FooLang.grm") macro call needs to be expanded. The only way this could work if the IDE conservatively expands all macros all the time.