🍮 A little taste of Compiler Theory in Action Dessert · How to Build Template String Nodes?

👓 Lexical Analysis and Syntax Analysis#

Since I am simply recording this knowledge point that took me a long time to understand, I won't go into lengthy discussions on these topics here, but just to help readers of this article better understand.

The lexical analyzer, commonly known as Lexer or Scanner in English, is mainly used to break the source code into a list, with each item being an indivisible "lexeme", corresponding to the English word Token.

Let's take an example in the C language, with the following source code:

#include <stdio.h>

int main() {
  printf("Hello world!");
}

Here is the information of all the Tokens in the above source code:

<Preprocessor directive #include> <Space> <Left angle bracket '<'> <Name stdio.h> <Right angle bracket '>'> <Newline>
<Newline>
<Keyword int> <Space> <Name main> <Left parenthesis '('> <Right parenthesis ')'> <Space> <Left curly brace '{'> <Newline>
<Space> <Space> <Name printf> <Left parenthesis '('> <String literal> <Right parenthesis ')'> <Semicolon ';'> <Newline>
<Right curly brace '}'>

The input of the lexical analyzer is a string of content, and the output is a list of Tokens. For each Token, its key attributes are Position, Raw content, and Token Type, with the position mainly recorded for easy problem locating when there are errors in the code later.

As for the syntax analyzer, commonly known as Parser in English, its working process can be understood as combining several Tokens that conform to the defined programming language design specification into a syntax node, and finally combining them into an "Abstract Syntax Tree" (AST).

🥵 Why Template Strings Are Not Easy to Handle#

Below, I will use the template string syntax of JavaScript as an example.

At the beginning of implementing the Lexer, I thought that as long as I read the input content one by one, I could definitely obtain a one-to-one correspondence of lexemes. However, when it comes to template strings, I found that it is different.

It is difficult to read such a template string, which may even contain nesting, as an "indivisible" lexeme during the lexical analysis stage. Obviously, it can be divided.

`my name is ${"David" + ` - ${firstName}`}, nice to meet you!`

🤔️ What Should We Do?#

In order to continue reading Tokens in sequence, we should set some variables to represent the current parsing state, and not treat template strings in the same way as ordinary string parsing.

What can be determined is that the processing of template strings must be done in the syntax analysis stage, and it will eventually form syntax nodes that contain the following two types of content:

Possibly multiple scattered string texts
Interpolation expressions, which are also syntax nodes

interface TemplateStringNode {
  quasis: TemplateElement[]
  expressions: ExpressionNode[]
}

So the best way is to find a distinctive Token, which tells the Parser to start parsing the template string when encountering this symbol, and process possible nesting situations based on the state information.

We establish the following two key pieces of information:

isReadingText, whether it is currently reading text
nested, the level of nesting

Based on the deduction and analysis in the above diagram, we can draw the following conclusions:

The template string quotation mark reverses the state of isReadingText
The interpolation expression start symbol ${ increases the nested level by one and sets isReadingText to false, because an interpolation expression is about to be read.
The right curly brace decreases the nested level by one and sets isReadingText back to true, because it returns to reading the text of the template string.

Until encountering a template string quotation mark that sets isReadingText to false, and the nested level is also 0, the parsing of a template string node is completed.

References#

Two Stackoverflow questions and answers: