跳到主要内容

2 篇博文 含有标签「Compiler Design」

the techniches about designing a comiler

查看所有标签

Ausar is Comming

· 阅读需 3 分钟
Ausar Wong
Free Developer

Introduction

Ausar is a light-weight script language written in pure Go with a dynamic type system. Its grammars are a deliberately simplified subset of the standard Go Spec. The original intention for this repo is to implement an expression engine as the core lib for the business rule engine and formula engine.

Why I Implement this Repo?

Close Relationship with My Working Scenario

I must emphasize that this repo is not a reimplementation of so many existed expresion engine libraries. I am working in a securities company with the high concern with the NUMERICAL ACCURACY SECURITY. For instance, the execution result of expr 0.1+0.2 must be 0.3 for my working scenarios. However, the mainstream of the existed open-source libararies fail to achieve this with taking the float number as its underlying number type. For float number, the execution result is ridiculous if the arithmetic operations are repeated with tremendous times.

Every Number should Be Decimal

To solve the issue as demonstrated above, Ausar takes the Decimal as the underlying type for numbers. Every inputted number would be transformed into the decimal type. Meanwhile, with the help of the famous Go Decimal library, every arithmetic operations of numbers would be converted into the corresponding decimal methods. For instance, the expr 0.1+0.2*0.3 would be eventually transformed into Decimal(0.1).Add(Decimal(0.2).Mul(Decimal(0.3))) during the execution phase.

From an Expression Engine to a Simplified Language

The original purpose of the repo is to implement a simple expression executor for the Business rule engine. As same as the equivalent Rust version: expression_engine_rs which I used to implement, the original version only supports some basic expressions like unary, binary, ternary expression and so on. The only complicated case is the chain expression with expressions separated by the punctuation ";" indicating that the expressions can be executed sequenctially.

Things became complicated when the formula engine scenario arised. Let me show a brief example as below:

func max(a, b) {
if a > b {
return a;
};
return b;
}

wA := 0.5;
wB := 0.4;

hA := 0.6;
hB := 0.9;

// size is the multiplication result of the maximum one of wA and wB with the maximum one of hA and hB.
size := max(wA,wB)*max(hA,hB);

This secnario requires more sophisticated features like function declarations and basic control flow. Eventually, the intention upgraded to the implementation of a light-weight script language.

Enthusiasm to The Compiler Design

I love compiler design with the genuine passion to decode the string into the corresponding abstact syntax tree. I used to spend 7 to 8 months in the implementation of some well-known automated parsing algorithms, like LL, LR, Early, and PEG. Thankfully, I eventually achieved these goals. I had the aspiration to apply the knowledge into some realistic scenarios. This repo combines the techniques of handwritten parsers in my early work and automated parsers (mainly PEG) of my recent work.

Compilers: Automated vs Handwritten

· 阅读需 11 分钟
Ausar Wong
Free Developer

Introduction

As we all know, there are many parser-generated techniques like Yacc, Bison and ANTLR. They are either an LL Parser or an LR Parser. Such parsers are automated. A compiler designer can easily "implement" a sophisticated parser with such automation techniques on the basis of a predefined CFG or PEG grammar. These delicate techniques facilitate the process of designing and implementation of a novel programming language. However, the majority of modern languages like Go and Rust abandoned such automation techniques and switched to a handwritten candidate. Why technologies that were prosperous decades ago are not popular today?

Short Answers

I think there are three major reasons for the revival of handwritten parsers:

  • Efficiency
  • Error Tracing
  • Code Optimization

Experiment

I make an experiment here to better illustrate the three reasons listed above. The experiment is simple, deserializing a valid JSON string. Specifically, the JSON deserialization task is composed of two subtasks: Decoding data from a JSON string Bind the decoded data to a structured representation

The former subtask is generic, and all the JSON parsers need to accomplish it. While it is unnecessary to bind the decoded JSON data to tremendous types of data structures. For experiment purpose, it is enough to bind the data to a basic JSON representation. JSON Representation Definition According to the JSON specification, JSON is a data format with 6 fundamental data types, respectively, Null, Number, Boolean, String, Array, and Object. As follows, the type "JsonValue" represents a basic JSON data with six pointer fields which corresponds to the six fundamental types of the JSON format. For instance, if the instanced JsonValue indicates a "Null" JSON data, the field "Null" is assigned with the five remaining fields are null pointers.

type JsonValue struct {
Null any `json:"null,omitempty"`
Number *float64 `json:"number,omitempty"`
Str *string `json:"str,omitempty"`
Array JsonArray `json:"array,omitempty"`
Object JsonObject `json:"object,omitempty"`
Bool *bool `json:"bool,omitempty"`
}

type JsonArray []JsonValue
type JsonObject map[string]JsonValue

Handwritten Parser

At first, I implemented a handwritten JSON parser, which is a Top-Down and Recursive descent parser. Thanks to the brief JSON Spec, the whole implementation is simple with just 170 lines of code, an entrance handler with 6 branch handlers.

const (
NumberRegStr = "-?(0|[1-9][0-9]*)(.[0-9]+)?([eE][-+]?(0|[1-9][0-9]*))?"
StringRegStr = `^"([^"\\\\]*|\\\\["\\\\bfnrt\/]|\\\\u[0-9a-f]{4})*"`
)

var numberReg, _ = regexp.Compile(NumberRegStr)

var stringReg, _ = regexp.Compile(StringRegStr)

type JsonDeserializerInterface interface {
Deserialize(input string) (*JsonValue, bool)
}

type JsonDeserializer struct{}

func (j JsonDeserializer) Deserialize(input string) (*JsonValue, bool) {
value, _, ok := j.deserialize(input)
return value, ok
}

type DeserializerHandlerInterface = func(input string) (value *JsonValue, remaining string, ok bool)

func (j JsonDeserializer) deserialize(input string) (value *JsonValue, remaining string, ok bool) {
for _, handler := range []DeserializerHandlerInterface{
j.deserializeNull,
j.deserializeNumber,
j.deserializeBool,
j.deserializeString,
j.deserializeArray,
j.deserializeObject,
} {
if value, remaining, ok = handler(input); ok {
return
}
}
return nil, input, false
}

func (j JsonDeserializer) deserializeNull(input string) (*JsonValue, string, bool) {
remaining := strings.TrimSpace(input)
if strings.HasPrefix(remaining, "null") {
return &JsonValue{
Null: true,
}, remaining[4:], true
}
return nil, input, false
}

func (j JsonDeserializer) deserializeNumber(input string) (*JsonValue, string, bool) {
remaining := strings.TrimSpace(input)
loc := numberReg.FindStringIndex(remaining)
if len(loc) == 0 || loc[0] != 0 {
return nil, input, false
}
numberStr := remaining[loc[0]:loc[1]]
number, err := strconv.ParseFloat(numberStr, 64)
if err != nil {
return nil, input, false
}
return &JsonValue{
Number: &number,
}, remaining[loc[1]:], true
}

func (j JsonDeserializer) deserializeBool(input string) (*JsonValue, string, bool) {
remaining := strings.TrimSpace(input)
var value bool
if strings.HasPrefix(remaining, "true") {
value = true
return &JsonValue{
Bool: &value,
}, remaining[4:], true
} else if strings.HasPrefix(remaining, "false") {
return &JsonValue{
Bool: &value,
}, remaining[5:], true
}
return nil, input, false
}

func (j JsonDeserializer) deserializeString(input string) (*JsonValue, string, bool) {
remaining := strings.TrimSpace(input)
loc := stringReg.FindStringIndex(remaining)
if len(loc) == 0 || loc[0] != 0 {
return nil, input, false
}
s := remaining[1 : loc[1]-1]
return &JsonValue{
Str: &s,
}, remaining[loc[1]:], true
}

func (j JsonDeserializer) deserializeArray(input string) (*JsonValue, string, bool) {
var remaining string
var ok bool
if remaining, ok = j.expect(input, "["); !ok {
return nil, input, false
}
var ans []JsonValue
for {
var value *JsonValue
value, remaining, ok = j.deserialize(remaining)
if !ok {
break
}
ans = append(ans, *value)
remaining, ok = j.expect(remaining, ",")
if !ok {
break
}
}
if remaining, ok = j.expect(remaining, "]"); !ok {
return nil, input, false
}
return &JsonValue{
Array: ans,
}, remaining, true
}

func (j JsonDeserializer) deserializeObject(input string) (*JsonValue, string, bool) {
input = strings.TrimSpace(input)
var remaining string
var ok bool
if remaining, ok = j.expect(input, "{"); !ok {
return nil, input, false
}
m := make(JsonObject)
for {
var key *JsonValue
var value *JsonValue
key, remaining, ok = j.deserializeString(remaining)
if !ok {
break
}
remaining, ok = j.expect(remaining, ":")
if !ok {
return nil, input, false
}
value, remaining, ok = j.deserialize(remaining)
if !ok {
return nil, input, false
}
m[*key.Str] = *value
remaining, ok = j.expect(remaining, ",")
if !ok {
break
}
}
if remaining, ok = j.expect(remaining, "}"); !ok {
return nil, input, false
}
return &JsonValue{
Object: m,
}, remaining, true
}

func (j JsonDeserializer) expect(input string, expected string) (string, bool) {
input = strings.TrimSpace(input)
if strings.HasPrefix(input, expected) {
return input[len(expected):], true
}
return input, false
}

Auto-generated

Different with the handwritten parsers, their automated rivals need to define the formal grammar and corresponding semantic actions to accomplish the deserialization task. CFG A standard JSON CFG is as below:

Json       -> Value
Array -> LBracket ValueList RBracket | LBracket RBracket
Object -> LCurly PairList RCurly | LCurly RCurly
PairList -> Pair Comma PairList | Pair
ValueList -> Value Comma ValueList | Value
Pair -> String Colon Value
Value -> String | Number | Object | Array | True | False
LCurly -> "{"
RCurly -> "}"
LBracket -> "\["
RBracket -> "\]"
String -> "^"([^"\\\\]*|\\\\["\\\\bfnrt\/]|\\\\u[0-9a-f]{4})*" "
Number -> "-?(0|[1-9][0-9]*)(.[0-9]+)?([eE][-+]?(0|[1-9][0-9]*))?"
True -> "true"
False -> "false"
Null -> "null"
Comma -> ","
Colon -> ":"

LL

The LL Parser is a Top-Down and non-recursive parser. It is auto-generated and table-driven. Semantic Actions For every production in the grammar, we need define the corresponding semantic action.

var llJsonActions = []ll.SemanticAction[string, JsonValue]{
// 0: Array -> LBracket Array"
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[1].Variable.Value
return nil
},

// 1: Array" -> RBracket
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Array: []JsonValue{},
}
return nil
},

// 2: Array" -> ValueList RBracket
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 3: Json -> Value
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 4: Object -> LCurly Object"
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[1].Variable.Value
return nil
},

// 5: Object" -> PairList RCurly
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 6: Object" -> RCurly
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Object: map[string]JsonValue{},
}
return nil
},

// 7: Pair -> String Colon Value
func(v *tree.Variable[string, JsonValue]) error {
key := v.Children[0].Terminal.Matched
value := v.Children[2].Variable.Value
v.Value = JsonValue{
Object: map[string]JsonValue{
key: value,
},
}
return nil
},

// 8: PairList -> Pair PairList"
func(v *tree.Variable[string, JsonValue]) error {
a, b := v.Children[0].Variable.Value.Object, v.Children[1].Variable.Value.Object
for key, value := range b {
a[key] = value
}
v.Value = JsonValue{
Object: a,
}
return nil
},

// 9: PairList" -> #
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Object: map[string]JsonValue{},
}
return nil
},

// 10: PairList" -> Comma PairList
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[1].Variable.Value
return nil
},

// 11: Value -> Array
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 12: Value -> False
func(v *tree.Variable[string, JsonValue]) error {
value := false
v.Value = JsonValue{
Bool: &value,
}
return nil
},

// 13: Value -> Null
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Null: true,
}
return nil
},

// 14: Value -> Number
func(v *tree.Variable[string, JsonValue]) error {
ans, err := strconv.ParseFloat(v.Children[0].Terminal.Matched, 64)
if err != nil {
return err
}
v.Value = JsonValue{
Number: &ans,
}
return nil
},

// 15: Value -> Object
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 16: Value -> String
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Str: &v.Children[0].Terminal.Matched,
}
return nil
},

// 17: Value -> True
func(v *tree.Variable[string, JsonValue]) error {
value := true
v.Value = JsonValue{
Bool: &value,
}
return nil
},

// 18: ValueList -> Value ValueList"
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Array: append([]JsonValue{v.Children[0].Variable.Value}, v.Children[1].Variable.Value.Array...),
}
return nil
},

// 19: ValueList" -> #
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Array: []JsonValue{},
}
return nil
},

// 20: ValueList" -> Comma ValueList
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[1].Variable.Value
return nil
},
}

LR

The LR Parser is a Bottom-Up and non-recursive parser. As the same as the LL Parser, it is auto-generated and table-driven too. Semantic Actions For every production in the grammar, we need define the corresponding semantic action.

var lrJsonActions = []lr.SemanticAction[string, JsonValue]{
// 0: Json' -> Json
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 1: Json -> Value
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 2: Array -> LBracket ValueList RBracket
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[1].Variable.Value
return nil
},

// 3: Array -> LBracket RBracket
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Array: make(JsonArray, 0),
}
return nil
},

// 4: Object -> LCurly PairList RCurly
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[1].Variable.Value
return nil
},

// 5: Object -> LCurly RCurly
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Object: make(JsonObject),
}
return nil
},

// 6: PairList -> Pair Comma PairList
func(v *tree.Variable[string, JsonValue]) error {
a, b := v.Children[0].Variable.Value.Object, v.Children[2].Variable.Value.Object
for key, value := range b {
a[key] = value
}
v.Value = JsonValue{
Object: a,
}
return nil
},

// 7: PairList -> Pair
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 8: ValueList -> Value Comma ValueList
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Array: append([]JsonValue{v.Children[0].Variable.Value}, v.Children[2].Variable.Value.Array...),
}
return nil
},

// 9: ValueList -> Value
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Array: []JsonValue{v.Children[0].Variable.Value},
}
return nil
},

// 10: Pair -> String Colon Value
func(v *tree.Variable[string, JsonValue]) error {
key := v.Children[0].Terminal.Matched
value := v.Children[2].Variable.Value
v.Value = JsonValue{
Object: map[string]JsonValue{
key: value,
},
}
return nil
},

// 11: Value -> String
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Str: &v.Children[0].Terminal.Matched,
}
return nil
},

// 12: Value -> Number
func(v *tree.Variable[string, JsonValue]) error {
ans, err := strconv.ParseFloat(v.Children[0].Terminal.Matched, 64)
if err != nil {
return err
}
v.Value = JsonValue{
Number: &ans,
}
return nil
},

// 13: Value -> Object
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 14: Value -> Array
func(v *tree.Variable[string, JsonValue]) error {
v.Value = v.Children[0].Variable.Value
return nil
},

// 15: Value -> True
func(v *tree.Variable[string, JsonValue]) error {
value := true
v.Value = JsonValue{
Bool: &value,
}
return nil
},

// 16: Value -> False
func(v *tree.Variable[string, JsonValue]) error {
value := false
v.Value = JsonValue{
Bool: &value,
}
return nil
},

// 17: Value -> Null
func(v *tree.Variable[string, JsonValue]) error {
v.Value = JsonValue{
Null: true,
}
return nil
},
}

Comparison

Efficiency

The performance benchmark highly corresponds to the parsing capability. I conduct an experiment here to test the performance of different parsers. The test data is as below:

// Here is a brief JSON data
{
"haha":"haha",
"dd": 23,
"d": true,
"c": null,
"extended": {"haha":"haha", "dd": 23, "d": true, "c": null},
"list": [1 , 2, false]
}

image

Benchmark for Handwritten Parser

image

Benchmark for Auto-generated LL Parser

image

Benchmark for Auto-generated LR Parser

The above pictures show that the handwritten parser performs much better than the auto-generated ones with 19846 ns/op and 78 allocs/op which are 10 times faster and more space-efficient than the latter.

Error Tracing & Code Optimization

Error tracing and code optimization play a vital role in the implementation of a robust and stable parser. As illustrated above, the whole parsing process can be classified into two sub processes: a) decoding; b) binding. For handwritten parsers, the designers can debug and optimize every piece of code. However, it's impossible for auto-generated ones to accomplish such sufficient code debugging and optimization. The runtime of the latter ones is not accessible for the implementers, which means they can only manipulate the code in sub process b): binding. It's hard for you to optimize the runtime for a specific scenario, except when you implement the runtime by yourself. Ultimately, you switch to the handwritten parsers.

Conclusion

This post provides a brief but pretty detailed explanation of the popularity of the handwritten parsers, which is a result of the pursuit for better efficiency, easier error tracing and code optimization.