Parslet
I would like to introduce to you parslet gem. Parslet is a small library for “constructing parsers in PEG (Parsing Expression Grammar) fashion”. The example below is a tutorial for this powerful library and an actual example how it was used by me in the past in a production system.
The idea
Imagine that you have a web application that stores data about books. And let’s say that you want to use Elasticsearch as a search engine. You want to give users the ability to search the books by title and the year of publication. For doing that the JS code in client side compiles a query String from the filters defined by a user. This query String is send to the server where it is converted to a Ruby Hash. The Hash is consumed by the Ruby client for Elasticsearch.
The query format
We need an idea how the query String would look like. I would do it like so: the field name is the part before :
character. After it the searched value for the field is placed.
For seperating multiple subqueries we will use ,,
. When looking for date ranges let’s use __
as the the range boundary.
This table describes what we want:
Query string | Description |
---|---|
titile:foo | Search for books with “foo” in the title |
year:1954 | Search for books published in 1954 |
year:__1956 | Search for books published before or in 1956 |
year:1954__ | Search for books published after or in 1954 |
year:1954__1956 | Search for books published between 1954 and 1956 |
title:foo,,year:__1956 | Search for books with “foo” in the title and published before or in 1956 |
Elasticsearch
The Ruby client for Elasticsearch has the method search
which accepts a Hash that represents the DSL used by the engine.
For example:
require 'elasticsearch'
client = Elasticsearch::Client.new log: true
client.search body: {query: {match: {title: "foo"}}} # will give books with "foo" in the title
The conversion
The table gives examples how the conversions should look like:
Query string | Ruby Hash |
---|---|
titile:foo | {query: {match: {title: "foo"}}} |
year:1849 | {filtered: {filter: {term: {year: "1849"}}}} |
year:__1849 | {filtered: {filter: {range: {year: {lte: "1849"}}}} |
year:1942__ | {filtered: {filter: {range: {year: {gte: "1942"}}}} |
year:1954__1956 | {filtered: {filter: {range: {year: {gte: "1954", lte: "1956"}}}} |
title:foo,,year:__1956 | {filtered: {query: {match: {'title' => "foo"}}, filter: {range: {'year' => {lte: "1956"}}}}} |
How to implement that? My proposition is to create a parser that will consume the query String and generate the query Hash for Elasticsearch. That’s where parslet comes into play.
Parslet
Parslet allows you to write a parser that will take the input String (our query) and create a syntax tree according to the rules specified by the programmer. “fields”, “values”, “ranges” will be leaves of the tree that we can use to generate the necessary Hash.
Parsing the input
In order to create a parser one needs to create a class that inherits from Parslet::Parser
.
Simple parser
Let’s start with something simple:
class Parser < Parslet::Parser
rule(:match_field) do
str('title')
end
rule(:filter_field) do
str('year')
end
rule(:value) do
any.repeat
end
rule(:subquery) do
(match_field.as(:match_field) | filter_field.as(:filter_field)) >> str(':') >> value.as(:value)
end
root(:subquery)
end
I’ll explain what happened here starting from top to bottom. First we’ve called rule(:match_field)
and rule(:filter_field)
- we’ve defined a rule for parsing match_field
and filter_field
respectively.
If we do in the console:
match_rule = Parser.new.match_field
match_rule.parse("title") # => "title"@0
filter_rule = Parser.new.filter_field
filter_rule.parse("year") # => "year"@0
filter_rule.parse("author") # => exception is thrown!
match_field
is simply String “title” where filter_field
is a String “year”.
Let’s continue with the value
rule:
value_rule = Parser.new.value
value_rule.parse("a value!") # => "a value!"@0
value_rule.parse(" another value!") # => " another value!"@0
Basically value
can be any non blank String.
Now for the subquery
rule:
subquery_rule = Parser.new.subquery
subquery_rule.parse("title:foo") # => {:field=>"title"@0, :value=>"foo"@6}
subquery_rule.parse("year:1954") # => {:field=>"year"@0, :value=>"1954"@5}
subquery_rule.parse("author:1954") # => exception is thrown!
For this rule we’ve used alternative operator |
and >>
operator which chains atoms as a sequence. So a subquery
is either match_field
or filter_field
followed by String ":"
and a value
.
As you might guess by now the method as
is used to name the leaves of the outcoming tree.
The root
method specifies the main rule which the parser uses when it starts consuming the String:
parser = Parser.new
parser.parse("title:foo") # => {:match_field=>"title"@0, :value=>"foo"@6}
Parsing ranges
Now we’ll want to extend our parser to be able to capture the date ranges:
class Parser < Parslet::Parser
def year(name)
match('[0-9]').repeat(4).as(name)
end
rule(:match_field) do
str('title')
end
rule(:filter_field) do
str('year')
end
rule(:value) do
any.repeat
end
rule(:lte) do
str('__') >> year(:lte)
end
rule(:gte) do
year(:gte) >> str("__")
end
rule(:between) do
year(:gte) >> str('__') >> year(:lte)
end
rule(:range) do
lte | gte | between
end
rule(:subquery) do
(match_field.as(:match_field) | filter_field.as(:filter_field)) >>
str(':') >>
(range.as(:range) | value.as(:value))
end
root(:subquery)
end
Note that match
method accepts a String not a Regexp. Also the match
method is for capturing a single character.
So this works match('[0-9]').repeat(4)
but this doesn’t match('[0-9]{4}')
.
Multiple filters
The last thing to do is to be able to specify many subqueries at once:
class Parser < Parslet::Parser
def year(name)
match('[0-9]').repeat(4).as(name)
end
rule(:match_field) do
str('title')
end
rule(:filter_field) do
str('year')
end
rule(:value) do
(str(',,').absent? >> any).repeat
end
rule(:lte) do
str('__') >> year(:lte)
end
rule(:gte) do
year(:gte) >> str("__")
end
rule(:between) do
year(:gte) >> str("__") >> year(:lte)
end
rule(:range) do
between | lte | gte
end
rule(:subquery) do
(match_field.as(:match_field) | filter_field.as(:filter_field)) >>
str(':') >>
(range.as(:range) | value.as(:value))
end
rule(:subqueries) do
(subquery >> (str(',,') >> subquery).repeat(0)).repeat(1).as(:subqueries)
end
root(:subqueries)
end
We’ve changed the root
to be the new rule subqueries
. subqueries
can be a single subquery
or multiple occurrences of a subquery
separated by the String ,,
. Sweat!
The other notable change is inside the value
rule. The method any
captures any character including ,
(our subquery
delimiter). To mitigate this we use absent?
which checks for the lack of presence of an atom (here str(',,')
) but without capturing it. In order to grok it let’s think how the parser consumes the input: title:f,,year:1954
:
title:f,,year:1954
^
|
-- Up until here it figured it out that this is the "match_field" rule.
title:f,,year:1954
^
|
-- Now it is matching against the "value" rule. "f" has been consumed.
title:f,,year:1954
^
|
-- "f," does not have ",," at the beginning so the parser proceeds
title:f,,year:1954
^
|
-- "f,," does not have ",," at the beginning so the parser proceeds
title:f,,year:1954
^
|
-- ",,y" DOES have ",," at the beginning so the parser have just matched the rule "value".
We've used "absent?" method so the String ",," is not captured (only "f").
Transforming the tree into Elasticsearch query
Now that we have split our query String into a syntax tree we need to transform it to a Hash consumable by the Elasticsearch client.
Parslet has a Transform
class which allows programmer to define deep Hash transformations. What we need is a Transform
like this:
class Transform < Parslet::Transform
rule(filter_field: simple(:filter_field), range: { lte: simple(:lte) }) do
{
filter: {
range: {
filter_field => { lte: lte.to_s }
}
}
}
end
rule(filter_field: simple(:filter_field), range: { gte: simple(:gte) }) do
{
filter: {
range: {
filter_field => { gte: gte.to_s }
}
}
}
end
rule(filter_field: simple(:filter_field), range: { lte: simple(:lte), gte: simple(:gte) }) do
{
filter: {
range: {
filter_field => { lte: lte.to_s, gte: gte.to_s }
}
}
}
end
rule(filter_field: simple(:filter_field), value: simple(:value)) do
{
filter: {
term: {
filter_field => value
}
}
}
end
rule(match_field: simple(:match_field), value: simple(:value)) do
{:match => { match_field => value}}
end
rule(subqueries: subtree(:subqueries)) do |dict|
# dict is already transformed Hash using the rules defined above
dict = dict[:subqueries]
output = {
filtered: {
# look if there's a `match` rule, if not include the `match_all` clause
query: dict.detect(-> { {match_all: {}} }){ |d| d[:match] },
}
}
filters = dict.map{ |d| d[:filter] }.compact
if filters.any?
# if any filters are present merge them under `filtered` key
output[:filtered].merge!(filter: {and: filters})
output
else
output
end
end
end
The first argument for the rule
method is a Hash to match against the product of the parser. The return value of the passed block is what will replace the matched Hash.
For example:
transform = Transform.new
transform.apply({match_field: "title", value: "foo"}) # => {:match=>{"title"=>"foo"}}
Another methods of Parslet::Transform
used here are simple
and subtree
. simple
roughly matches a value of a Hash which is not an Enumerable. subtree
is a placeholder for tree transformation patterns that will match any kind of subtree.
The block for subtree
accepts an argument (dict
) which is the transformed input up until this point. In the passed block to subtree
we construct the final Hash.
See for yourself:
transform = Transform.new
transform.apply(subqueries: [{match_field: "title", value: "foo"}]) # => {:filtered=>{:query=>{:match=>{"title"=>"foo"}}}}
Putting it all together
Now that we have all the pieces sorted out we can use them to query Elasticsearch.
First create some books. From your console:
curl -XPUT "http://localhost:9200/books/book/1" -d'
{
"title": "The Godfather",
"year": 1969
}'
curl -XPUT "http://localhost:9200/books/book/2" -d'
{
"title": "The Count of Monte Cristo",
"year": 1844
}'
Now from irb:
require 'elasticsearch'
require 'parslet'
input = "title:godfather,,year:__1970"
tree = Parser.new.parse(input) # => {:subqueries=>[{:match_field=>"title"@0, :value=>"Godfather"@6}, {:filter_field=>"year"@17, :range=>{:lte=>"1970"@24}}]}
query = Transform.new.apply(tree) # => {:filtered=>{:query=>{:match=>{"title"@0=>"Godfather"@6}}, :filter=>{:and=>[{:range=>{"year"@17=>{:lte=>"1970"}}}]}}}
client = Elasticsearch::Client.new
client.search(index: "books", body: {query: query}) # => {"took"=>18, "timed_out"=>false, "_shards"=>{"total"=>5, "successful"=>5, "failed"=>0}, "hits"=>{"total"=>1, "max_score"=>0.19178301, "hits"=>[{"_index"=>"books", "_type"=>"book", "_id"=>"1", "_score"=>0.19178301, "_source"=>{"title"=>"The Godfather", "year"=>1969}}]}}
The returned Hash has got the "hits"
key which stores what we were looking for.
Conclusion
The example I’ve shown here is pretty simple. It could be replaced with an approach that uses Regexp. But as soon as we start adding more fields and operators to our grammar the advantage of using parslet becomes apparent. Furthermore I found writing your own parser a lot more pleasent.
The code that uses this approach for querying Elasticsearch server can be found here.