Usage

Intro

ParSitter.jl supports matching and querying any tree that supports the AbstractTrees.jl interface. Throughout the documentation, two types of trees will be mentioned:

  • query tree the tree which is used extract values from the target tree.
  • target tree the tree which one queries
Note

The difference between matching and querying is that matching attempts to match trees starting from the root and progressing recursively towards the leafs while querying matches a query tree with all possible sub-trees of a target tree.

Support functions for matching

Because matching or querying trees can be done on very different trees (some tree nodes may be complex objects), the querying and matching functions rely on six helper functions. These are provided to the match_tree and query matching and querying functions as keyword arguments:

  • targe_tree_nodevalue: extract the target tree node's value
  • query_tree_nodevalue: extract the query tree node's value
  • capture_function: extract captured values from matched target nodes
  • node_comparison_yields_true: make two nodes always match; this is useful when one wants to skip node comparison i.e. capture nodes or explicitly ignore nodes
  • is_capture_node: check is a node is a capture node or not
  • node_equality_function: compares the values of target and query nodes

With the help of the functions, the matching function becomes generic as it becomes possible to match arbitrarily complex trees: node values are extracted from nodes, the equality function over extracted values, custom capture symbols and wild-cards are applied through the conditions for skipping nodes from comparison. Finally, custom value capture is applied for nodes of matching target sub-trees.

Building trees

Query trees

The library defines a single structure for working with query trees, a tree-query-expression, through ParSitter.TreeQueryExpr object. It is a simple object whose purpose is to adhere to the AbstractTrees.jl interface. Query trees can be constructed from Tuples or NTuples

julia> using ParSitter, AbstractTrees
julia> tt = (1,2,(3,(4,5,(6,),7,5)));
julia> tq = ParSitter.build_tq_tree(tt)ParSitter.TreeQueryExpr{Int64}(1, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(2, ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{Int64}(3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{Int64}(6, ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{Int64}(7, ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])])])])
julia> print_tree(tq)1 ├─ 2 └─ 3 └─ 4 ├─ 5 ├─ 6 ├─ 7 └─ 5

and converted back to Tuples and NTuples:

julia> tt = convert(Tuple, tq)(1, 2, (3, (4, 5, 6, 7, 5)))

The basic operating principle is that in the tuple, the first value is the root of the tree and the rest are leafs. When nesting tuples, the first value of an enclosed tuple is the root and the rest become leafs.

julia> tt = ("root", "L1_leaf1", "L1_leaf2", ("L2_root", "L2_leaf2"))("root", "L1_leaf1", "L1_leaf2", ("L2_root", "L2_leaf2"))
julia> tq = ParSitter.build_tq_tree(tt)ParSitter.TreeQueryExpr{String}("root", ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{String}("L1_leaf1", ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{String}("L1_leaf2", ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{String}("L2_root", ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{String}("L2_leaf2", ParSitter.TreeQueryExpr[])])])
julia> print_tree(tq)"root" ├─ "L1_leaf1" ├─ "L1_leaf2" └─ "L2_root" └─ "L2_leaf2"

Code trees

Code is represented as EzXML.Node objects. Therefore code querying will resort to matching ::TreeQueryExpr-based trees with ::EzXML.Node-based trees. This is because under the hood, ParSitter relies on tree-sitter to parse code through the following sequence of operations:

  • shell out from Julia and run tree-sitter on either code i.e. a string, file content or directory
  • tree-sitter parses the code and outputs an XML string that contains the code AST
  • the XML AST content is read back into Julia
  • EzXML parses the XML and outputs an EzXML.Node object.

In order to parse code, files and directories one needs to wrap wither the code's string, file path of directory path into ParSitter.Code, ParSitter.File and ParSitter.Directory objects respectively.

julia> code = ParSitter.Code("def hello(): pass")ParSitter.Code("def hello(): pass")
julia> result = parse(code, "python")ParSitter.ParseResult(nothing, "<?xml version=\"1.0\"?><module srow=\"0\" scol=\"0\" erow=\"0\" ecol=\"17\"> <function_definition srow=\"0\" scol=\"0\" erow=\"0\" ecol=\"17\"> def <identifier field=\"name\" srow=\"0\" scol=\"4\" erow=\"0\" ecol=\"9\">hello</identifier> <parameters field=\"parameters\" srow=\"0\" scol=\"9\" erow=\"0\" ecol=\"11\"> ( ) </parameters> : <block field=\"body\" srow=\"0\" scol=\"13\" erow=\"0\" ecol=\"17\"> <pass_statement srow=\"0\" scol=\"13\" erow=\"0\" ecol=\"17\"> pass </pass_statement> </block> </function_definition></module>")
julia> ct = build_xml_tree(result)EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x000000002608bf90>))
julia> print_tree(ct.root)("module", "0xa02dfdb38907542a", (row = "0:0", col = "(0:17)"), " def hello ( ) : pas...") └─ ("function_definition", "0xf262f92ef3d58282", (row = "0:0", col = "(0:17)"), " def hello ( ) : pas...") ├─ ("identifier", "0x9bd4b8de7fa7d9da", (row = "0:0", col = "(4:9)"), "hello") ├─ ("parameters", "0x1e9b2077783159a3", (row = "0:0", col = "(9:11)"), " ( ) ") └─ ("block", "0xc85aefe8d4ba9f48", (row = "0:0", col = "(13:17)"), " pass ") └─ ("pass_statement", "0xf071c362374ae42", (row = "0:0", col = "(13:17)"), " pass ")

Matching trees

Basics

As previously mentioned, matching trees requires specifying functions that applied to a node of the tree extract its value for the purpose of comparison. A capture function needs to be provided for extracting a specific value from the node. These functions are necessary as target and query trees may contain complex nodes that are objects themselves and may need processing for matching and value capture to occur. In order to be able to match values and at the same time skip comparisons when capturing values, the argument node_comparison_yields_true needs to be specified. Its value should be a function that takes two nodes and returns true if the value needs to be captured.

julia> # Define the helper functions
       _capture_function(n) = "captured_value=" * string(n.head);
julia> _query_tree_nodevalue(n) = ParSitter.is_capture_node(n).is_match ? split(n.head, "@")[1] : n.head;
julia> _target_tree_nodevalue(n) = string(n.head);
julia> _when_to_yield_true(t1,t2) = ParSitter.is_capture_node(t2).is_match && isempty(_query_tree_nodevalue(t2));
julia> tt = ParSitter.build_tq_tree((1,2,(3,(4,)))); # a target tree
julia> tq = ParSitter.build_tq_tree(("1","@v","3")); # a query tree, capture to 'v'
julia> print_tree(tt)1 ├─ 2 └─ 3 └─ 4
julia> print_tree(tq)"1" ├─ "@v" └─ "3"
julia> ParSitter.match_tree( tt, tq; capture_function = _capture_function, target_tree_nodevalue = _target_tree_nodevalue, query_tree_nodevalue = _query_tree_nodevalue, node_comparison_yields_true = _when_to_yield_true)(true, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v" => ["captured_value=2"])), ParSitter.TreeQueryExpr{Int64}(1, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(2, ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{Int64}(3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[])])]))

A full example which matches numerical trees to string queries:

julia> my_matcher(t,q) = ParSitter.match_tree(
                              ParSitter.build_tq_tree(t),
                              ParSitter.build_tq_tree(q);
                              target_tree_nodevalue = _target_tree_nodevalue,
                              query_tree_nodevalue = _query_tree_nodevalue,
                              capture_function = _capture_function,
                              node_comparison_yields_true = _when_to_yield_true);
julia> query = ("1@v0", "2", "@v2") # - query means: capture in "v0" if target value is 1, match on 2, capture any symbol in "v2"("1@v0", "2", "@v2")
julia> t=(1,2,10); my_matcher( t, query)[1:2] |> println(true, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v2" => ["captured_value=10"], "v0" => ["captured_value=1"])))
julia> t=(10,2,11); my_matcher( t, query)[1:2] |> println(false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v2" => ["captured_value=11"])))
julia> t=(1,2,3,4,5); my_matcher( t, query)[1:2] |> println(true, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v2" => ["captured_value=3"], "v0" => ["captured_value=1"])))

Querying trees

Tree queries match the query tree to the target tree and all its sub-trees.

julia> query = ("1@v0", "2", "@v2");   # - query means: capture in "v0" if target value is 1, match on 2, capture any symbol in "v2"
julia> target = (1, 2, 3, (10, 2, 3)); # - only the (1,2,3) sub-tree will match, the second will not bevause of the 10;
julia> # - @v2 will always capture values (due to `_capture_on_empty_query_value`) query_tq = ParSitter.build_tq_tree(query);
julia> target_tq = ParSitter.build_tq_tree(target);
julia> print_tree(target_tq);1 ├─ 2 ├─ 3 └─ 10 ├─ 2 └─ 3
julia> print_tree(query_tq);"1@v0" ├─ "2" └─ "@v2"

The :strict query mode matches exactly i.e. order counts as well as values, query nodes to target tree nodes.

julia> r=ParSitter.query(target_tq,
                         query_tq;
                         match_type = :strict,
                         target_tree_nodevalue = _target_tree_nodevalue,
                         query_tree_nodevalue = _query_tree_nodevalue,
                         capture_function = _capture_function,
                         node_comparison_yields_true = _when_to_yield_true);
julia> map(t->t[1:2], r)6-element Vector{Tuple{Bool, DataStructures.MultiDict{Any, Any}}}: (1, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v2" => ["captured_value=3"], "v0" => ["captured_value=1"]))) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}())) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}())) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v2" => ["captured_value=3"]))) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}())) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()))

The :nonstrict query mode will match all nodes if possible.

julia> r=ParSitter.query(target_tq,
                         query_tq;
                         match_type = :nonstrict,
                         target_tree_nodevalue = _target_tree_nodevalue,
                         query_tree_nodevalue = _query_tree_nodevalue,
                         capture_function = _capture_function,
                         node_comparison_yields_true = _when_to_yield_true);
julia> map(t->t[1:2], r)6-element Vector{Tuple{Bool, DataStructures.MultiDict{Any, Any}}}: (1, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v2" => ["captured_value=3", "captured_value=10"], "v0" => ["captured_value=1"]))) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}())) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}())) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v2" => ["captured_value=3"]))) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}())) (0, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()))
Note

The :nonstrict matching matching mode may return multiple captured values for a specific named capture however it will not be possible to trace back the whole tree to which the capture belongs. This means that it is not possible to retrieve the other associated matches

The :speculative match mode

This feature is only available if v0.2.0

The :speculative matching mode is faster that :nonstrict because it stops after the first sub-tree match at each level during the recursive search. The result is that it will return a single value for each named capture even if more could be retrieved.

julia> _when_to_yield_true(tt, qt) =
           (
           ParSitter.is_capture_node(qt; capture_sym = "@").is_match
               && isempty(_query_tree_nodevalue(qt))
       ) ||
           _query_tree_nodevalue(qt) == "*"_when_to_yield_true (generic function with 1 method)
julia> target = ParSitter.build_tq_tree((1, 2, (3, (4, 5)), (-3, -4)))ParSitter.TreeQueryExpr{Int64}(1, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(2, ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{Int64}(3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])])]), ParSitter.TreeQueryExpr{Int64}(-3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(-4, ParSitter.TreeQueryExpr[])])])
julia> query = ParSitter.build_tq_tree(("*", ("*", "*"), ("*", ("@v"))))ParSitter.TreeQueryExpr{String}("*", ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{String}("*", ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{String}("*", ParSitter.TreeQueryExpr[])]), ParSitter.TreeQueryExpr{String}("*", ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{String}("@v", ParSitter.TreeQueryExpr[])])])
julia> results = ParSitter.query( target, query; match_type = :speculative, target_tree_nodevalue = _target_tree_nodevalue, query_tree_nodevalue = _query_tree_nodevalue, capture_function = _capture_function, node_comparison_yields_true = _when_to_yield_true)7-element Vector{Any}: (true, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v" => ["captured_value=-4"])), ParSitter.TreeQueryExpr{Int64}(1, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(2, ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{Int64}(3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])])]), ParSitter.TreeQueryExpr{Int64}(-3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(-4, ParSitter.TreeQueryExpr[])])])) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(2, ParSitter.TreeQueryExpr[])) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])])])) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])])) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(-3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(-4, ParSitter.TreeQueryExpr[])])) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(-4, ParSitter.TreeQueryExpr[]))
julia> get_capture(results, "v")1-element Vector{Any}: "captured_value=-4"

In contrast, :nonstrict will return more captured values since two sub-trees match:

julia> results = ParSitter.query(
           target,
           query;
           match_type = :nonstrict,
           target_tree_nodevalue = _target_tree_nodevalue,
           query_tree_nodevalue = _query_tree_nodevalue,
           capture_function = _capture_function,
           node_comparison_yields_true = _when_to_yield_true)7-element Vector{Any}:
 (true, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("v" => ["captured_value=-4", "captured_value=4"])), ParSitter.TreeQueryExpr{Int64}(1, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(2, ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{Int64}(3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])])]), ParSitter.TreeQueryExpr{Int64}(-3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(-4, ParSitter.TreeQueryExpr[])])]))
 (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(2, ParSitter.TreeQueryExpr[]))
 (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])])]))
 (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(4, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[])]))
 (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(5, ParSitter.TreeQueryExpr[]))
 (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(-3, ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{Int64}(-4, ParSitter.TreeQueryExpr[])]))
 (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), ParSitter.TreeQueryExpr{Int64}(-4, ParSitter.TreeQueryExpr[]))
julia> get_capture(results, "v")ERROR: Ambigous capture, more than 1 match for key "v"

More examples of tree-matching behavior can be found in the query tests.

Query DSL

This feature is only available if v0.2.0

A high-level DSL for writing queries as real code snippets with placeholders aimed ad intuitive, language-native querying is available on top of the low-level S-Tuple based querying. It is based on the concept that querying code should be done with real code snippets. The locations or placeholders where code is to be captured or ignored are marked with {{}}. Currently, the current query string placeholders are supported:

  • {{capture_name::CAPTURE_TYPE}} - named capture (extracts value into capture_name).
  • {{::CAPTURE_TYPE}} - non-capturing placeholder (matches tree structure only).
  • {{some_valid_code}} - Generic code insertion (use custom_replacements argument), non-capturing.
julia> using ParSitter.QueryLanguage, AbstractTrees
julia> code_snippet = """ def {{func_name::IDENTIFIER}}(): x = {{::STRING}} """"def {{func_name::IDENTIFIER}}():\n x = {{::STRING}}\n"
julia> query_expr, _ = parse_code_snippet_to_query(code_snippet, "python")(ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "module"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "function_definition"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("@func_name", "identifier"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "parameters"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "block"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "assignment"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("x", "identifier"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "string"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "string_start"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "string_content"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "string_end"), ParSitter.TreeQueryExpr[])])])])])]), Dict{Any, Any}("\"_xzAE0MV2UU\"" => ("STRING", false), "func_name_t3aR1P5sWY" => ("IDENTIFIER", true)), "def func_name_t3aR1P5sWY():\n x = \"_xzAE0MV2UU\"\n")
julia> print_tree(query_expr, maxdepth=10)"*" └─ "*" ├─ "@func_name" ├─ "*" └─ "*" └─ "*" ├─ "x" └─ "*" ├─ "*" ├─ "*" └─ "*"
julia> # Optional custom replacements for generic placeholders
       query_expr, _ = parse_code_snippet_to_query(
          "x = {{my_expr}}",
          "julia";
          custom_replacements = Dict("my_expr" => "1 + 2")
       )(ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "source_file"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "assignment"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("x", "identifier"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "operator"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "binary_expression"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "integer_literal"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "operator"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "integer_literal"), ParSitter.TreeQueryExpr[])])])]), Dict{Any, Any}("1 + 2" => ("GENERIC_CODE", false)), "x = 1 + 2")
julia> print_tree(query_expr)"*" └─ "*" ├─ "x" ├─ "*" └─ "*" ├─ "*" ├─ "*" └─ "*"

How query generation from snippet works internally:

  • placeholders are replaced with valid language-specific code (e.g., randomized identifiers or type-specific literals). Supported CAPTURE_TYPEs and associated code are present in ParSitter.DEFAULT_TYPE_REPLACEMENTS) and loaded from the languages/ directory.
  • the snippet is parsed with tree-sitter to an XML AST.
  • the XML AST is parsed to an EzXML tree and converted to ParSitter.TreeQueryExpr where:
    • each node is a ParSitter.TreeQueryNode containing value and type. The types of query nodes will be node types supported by tree-sitter exclusively.
    • captures become @capture_name.
    • Non-captures become wildcards ("*")
    • structure is preserved exactly.
  • The final generated query can be used with match_tree or query.

Querying code

Below is a minimal example of querying a snippet of code written in R.

julia> using ParSitter, AbstractTrees
julia> _target_nodevalue(n) = strip(replace(n.content, r"[\s]" => ""));
julia> _query_nodevalue(n) = ifelse(ParSitter.is_capture_node(n).is_match, string(split(n.head.value, "@")[1]), n.head.value);
julia> _apply_regex_glob(tn, qn) = ParSitter.is_capture_node(qn; capture_sym = "@").is_match || qn.head.value == "*";
julia> _capture_function(n) = (v = strip(replace(n.content, r"[\s]" => "")), srow = n["srow"], erow = n["erow"], scol = n["scol"], ecol = n["ecol"]);
julia> R_code = ParSitter.Code( """ # a comment mod12 <- glmmTMB(y ~ x1 + x2 + x3 + x4 + (0 | x5), data = data_variable, family = binomial(link = "linear")) """ )ParSitter.Code("# a comment\nmod12 <- glmmTMB(y ~ x1 + x2 + x3 + x4 + (0 | x5),\n data = data_variable,\n family = binomial(link = \"linear\"))\n")
julia> language = "r""r"
julia> _parsed = ParSitter.parse(R_code, language);
julia> target = ParSitter.build_xml_tree(_parsed);
julia> query_snippet = """ {{comment::COMMENT}} {{::IDENTIFIER}} <- glmmTMB({{::R_FORMULA}}, family ={{family::IDENTIFIER}}({{identifier::IDENTIFIER}}={{id_val::STRING}})) """" {{comment::COMMENT}}\n {{::IDENTIFIER}} <- glmmTMB({{::R_FORMULA}},\n family ={{family::IDENTIFIER}}({{identifier::IDENTIFIER}}={{id_val::STRING}}))\n"
julia> generated_query, _, _ = ParSitter.QueryLanguage.parse_code_snippet_to_query(query_snippet, language)(ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "program"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("@comment", "comment"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "binary_operator"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "identifier"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "call"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("glmmTMB", "identifier"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "arguments"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "argument"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "comma"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "argument"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("family", "identifier"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "call"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("@family", "identifier"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "arguments"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "argument"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("@identifier", "identifier"), ParSitter.TreeQueryExpr[]), ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("@id_val", "string"), ParSitter.TreeQueryExpr[ParSitter.TreeQueryExpr{ParSitter.TreeQueryNode}(ParSitter.TreeQueryNode("*", "string_content"), ParSitter.TreeQueryExpr[])])])])])])])])])]), Dict{Any, Any}("y~x_VPKfjQj4dD" => ("R_FORMULA", false), "var_81mwoti5ES" => ("IDENTIFIER", false), "identifier_VYrHrfn3Vq" => ("IDENTIFIER", true), "\"id_val_ygVuoH9RnC\"" => ("STRING", true), "family_UobwF8j4yX" => ("IDENTIFIER", true), "#comment_UlEMqVAsQQ" => ("COMMENT", true)), " #comment_UlEMqVAsQQ\n var_81mwoti5ES <- glmmTMB(y~x_VPKfjQj4dD,\n family =family_UobwF8j4yX(identifier_VYrHrfn3Vq=\"id_val_ygVuoH9RnC\"))\n")
julia> print_tree(generated_query, maxdepth=10)"*" ├─ "@comment" └─ "*" ├─ "*" └─ "*" ├─ "glmmTMB" └─ "*" ├─ "*" ├─ "*" └─ "*" ├─ "family" └─ "*" ├─ "@family" └─ "*" └─ "*" ├─ "@identifier" └─ "@id_val" └─ "*"
julia> query_results = ParSitter.query( target.root, generated_query; match_type = :speculative, target_tree_nodevalue = _target_nodevalue, query_tree_nodevalue = _query_nodevalue, capture_function = _capture_function, node_comparison_yields_true = _apply_regex_glob )36-element Vector{Any}: (true, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("family" => [(v = "binomial", srow = "3", erow = "3", scol = "26", ecol = "34")], "id_val" => [(v = "\"linear\"", srow = "3", erow = "3", scol = "42", ecol = "50")], "identifier" => [(v = "link", srow = "3", erow = "3", scol = "35", ecol = "39")], "comment" => [(v = "#acomment", srow = "0", erow = "0", scol = "0", ecol = "11")])), EzXML.Node(<ELEMENT_NODE[program]@0x00000000271dec50>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[comment]@0x00000000298f6890>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[binary_operator]@0x0000000025d2e150>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[identifier]@0x0000000027be6950>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[call]@0x0000000026eaa590>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[identifier]@0x0000000026e16190>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[arguments]@0x0000000027e14310>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[argument]@0x0000000027d06410>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[binary_operator]@0x000000002a030d90>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[identifier]@0x0000000028cc0990>)) ⋮ (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[argument]@0x0000000026a6ce90>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[identifier]@0x00000000278f46b0>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[call]@0x0000000027736250>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[identifier]@0x000000002724c5d0>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[arguments]@0x0000000027507010>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[argument]@0x0000000028371410>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[identifier]@0x0000000026e13d50>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[string]@0x0000000026927aa0>)) (false, DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}()), EzXML.Node(<ELEMENT_NODE[string_content]@0x0000000027de3810>))
julia> filter!(first, query_results); # keep only matches
julia> println(query_results[1][2])DataStructures.MultiDict{Any, Any}(Dict{Any, Vector{Any}}("family" => [(v = "binomial", srow = "3", erow = "3", scol = "26", ecol = "34")], "id_val" => [(v = "\"linear\"", srow = "3", erow = "3", scol = "42", ecol = "50")], "identifier" => [(v = "link", srow = "3", erow = "3", scol = "35", ecol = "39")], "comment" => [(v = "#acomment", srow = "0", erow = "0", scol = "0", ecol = "11")]))

More examples of tree-matching behavior can be found in the query language tests.

CLI-based parsing

ParSitter.jl comes with an CLI tool that allows easy parsing of inline code, files and directories. Currently, it supports the following languages: Python, Julia, C, C# and R. This can be extended by adding more language files in languages/.

Installing tree-sitter languages

In order to be able to parse code, tree-sitter and plugins for specific languages need to be installed. For example, to install the python language parser and Assuming that we want to install it to a directory named _parsers, located in the current directory, the following sequence of commands should do it:

cd _parsers
git clone https://github.com/tree-sitter/tree-sitter-python
cd tree-sitter-python
tree-sitter generate

Running the CLI tool

When ran, it returns a JSON string of the form:

{ "path/to/file":"parsed code in XML format",
  ...
}

For directories the JSON will contain more key-value pairs and for inline code the file path key is an empty string. For example, the following command

julia --project parsitter.jl ./test/code/python/test_project/main.py --input-type file --language python --log-level error

will result in

{".../ParSitter.jl/test/code/python/test_project/main.py":"<?xml version=\"1.0\"?><module srow=\"0\" scol=\"0\" erow=\"15\" ecol=\"0\">  <import_from_sta...
}

Note the --escape-chars option should be used if parsing inline code with \n, '\t' or '\r' characters.

For example the following works,

$ julia parsitter.jl 'def foo():pass' --input-type code --language python --log-level debug

however if escape chars are present, use the --escape-chars option:

$ julia parsitter.jl 'def foo():\n\tpass' --input-type code --escape-chars --language python --log-level debug