Implementing a Text-File Parser with nom
This is a cleaned up overview of what I did to implement at text based parser with Nom, which should serve as maybe 35% tutorial and 90% Live-journal, train of thought rambling.
Nom is a parser combinator based parser library in Rust. Other parser libraries in Rust are (pest)[https://github.com/pest-parser/pest] which is PEG based, and (LALRPOP)[https://github.com/lalrpop/lalrpop] which generates LR(1) parsers.
###Parser Combinators Parsing is an unsolved problem, there’s no right answer like ISO 8609 for dates or serializable isolation for DBs. (Here is a short overview of the challenges)[https://tratt.net/laurie/blog/entries/parsing_the_solved_problem_that_isnt.html] .
Essentially there are competing ways of describing your language, and many theoretical trade offs to make. No version is perfect.
Parser combinators are a version whos pros are
- a) easy to read
- b) easy to build up from bottom up
and the main con people bring up is that they are equivalent in power to LLR(1) parsers, and therefore have the same ambiguity/readability issues. The one thing that I like about nom/combinators, is how testable they are! Combined with Rusts’ inline testing it makes it really easy to build up.
###About Nom Nom has changed a lot over time and you may find outdated documentations describing old macros or older ways of doing things. Mainly, nom implements the Combinator-ing through macros that can take any number of arguments. This sometimes makes them hard to debug when something happens.
The other thing to know about nom is its IResult type. The parsers that you write and combinate take your input (&str or &bytes) and all
return a IResult
type, which is equivalent to a Result<(Input, Output), nomError>
type (and you can (choose)[https://github.com/Geal/nom/blob/master/doc/error_management.md]
the error type later if you want). The Input in the Result carries the as-of-yet unparsed string as you apply a parser.
So applying a parser for “Hello” to “Hello World” would return a Ok((" World", "Hello"))
. We’ll see more of this later.
###The format So, what we’ll be parsing are TSPLIB problem definitions, which mostly look like this:
NAME: berlin52
TYPE: TSP
COMMENT: 52 locations in Berlin (Groetschel)
DIMENSION: 52
EDGE_WEIGHT_TYPE: EUC_2D
NODE_COORD_SECTION
1 565.0 575.0
2 25.0 185.0
3 345.0 750.0
4 945.0 685.0
5 845.0 655.0
6 880.0 660.0
7 25.0 230.0
... (edges omitted for brevity)
EOF
The spec is (here)[https://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/tsp95.pdf]. The reason for the Spec is that there are a lot of other types of TSP-like problems (eg asymmetric TSP) and different ways to state the problem. Examples are solving a “all cities are connected by their Euclidean distance” problem vs a “take these particular edges with these weights” problem, and an option to describe adjacency lists vs a full adjacency matrix.
Here is a simple parser:
named!(name<&str, &str>,
tag_s!("NAME:")
);
Will make a function called name
, which will parse the atom “NAME:”, for our problems’ name.
The type signature is <input_type, output_type>
. The input type can be &str or &bytes.
There is an alternative way of writing parsers you may see here and throughout the literature, using functions instead of the named!
macro.
This is what that functions looks like:
fn name(input: &str) -> IResult<&str, &str> {
tags_s!(input, "NAME:")
}
Instead here you need to explicitly return an IResult<input_type, out_type>
.
tag_s!
will be our building block; it parses hard-coded strings. (The non _s function
tagtakes bytes). Ok, but what if I don't care about the
NAME` token, but the actual string that comes after?
named!(name<&str, &str>,
preceded!("NAME:")
);
Gets you whatever comes after NAME:
.
But wait, I think there’s optional whitespace after the colon! Or at least I want to support it.
This gets us to our next bread & butter function, do_parse!
(used to be called chain! I think), which lets you chain things, and also map your result to a type.
named!(name<&str,&str>,
do_parse!(
tag_s!("NAME:") >>
space0 >>
value: rest >>
(value.trim_end())
)
);
So with do_parse
you can chain parsers to parse, say, a whole line, and you can give names to some of the parsers’ outputs, and return it.
Here I use the space0
parser which is 0 or more whitespace characters, and the rest
parser, which gives you the rest of an input. (This works on a single line)
After chaining these parsers together, we return the value
, but apply trim_end()
to it in case there are any trailing whitespaces/tabs.
Nice! But a lot of these headers are defined as KEY: VALUE
pairs, so I’d like to make one that can be given a KEY and return me the VALUE.
Enter named_args
:
named_args!(kv<'a>(key: &'a str)<&'a str, &'a str>,
do_parse!(
tag_s!(key) >>
space0 >>
tag_s!(":") >>
space0 >>
value: rest >>
(value)
)
);
Sadly I had to put lifetimes in, and I don’t understand what these mean (because of the macro interaction).
This makes a function called kv
which takes both input :str
, and a key:str. Unfortunately you have to start giving lifetimes here, but it’s not too bad.
Now my test looks like
#[test]
fn name_test() {
assert_eq!(kv("NAME: some_name45", "NAME"), Ok(("", "some_name45")))
}
Other parsers can call the kv
function with the call!
macro.
Let’s use it again!
###The TYPE line.
I like this example because the difficulty (after an abrupt jump into the Nom pool) is ramping up real slow.
Next is the TYPE
property, which is an ENUM of
1.1.2 TYPE : <string>
Specifies the type of the data. Possible types are
TSP Data for a symmetric traveling salesman problem
ATSP Data for an asymmetric traveling salesman problem
SOP Data for a sequential ordering problem
HCP Hamiltonian cycle problem data
CVRP Capacitated vehicle routing problem data
TOUR A collection of tours
So this is a straight up Rust enum:
#[derive(Debug, PartialEq, Eq, Clone, Display, EnumString, EnumIter)]
enum ProblemType {
TSP,
ATSP,
SOP,
HCP,
CVRP,
TOUR,
}
I wanted to have some automatically derived FromString
function. Turns out this is called the FromStr
trait, and you usually write it
yourself, but I found a crate called strum
which gives us a derive macro, EnumString
, we can use. So now we could do
let problem_type: ProblemType = "TSP".parse().unwrap();
Let’s go implement our TYPE parser! What I like about parser combinators is that once you get them right, they’re really elegant! This is what the TYPE parser looks like:
named!(get_type<&str, ProblemType>,
map_res!(call!(kv, "TYPE"), str::parse)
);
Oh yes, this introduces another combinator, map_res!
. It takes the output from a combinator and applies a function to it, I believe exactly like
the regular Result.map. And that’s all we have to do!
I made this quick and dirty unit test with a vim macro:
#[test]
fn test_type() {
assert_eq!(get_type("TYPE: TSP"), Ok(("", ProblemType::TSP)));
assert_eq!(get_type("TYPE: ATSP"), Ok(("", ProblemType::ATSP)));
assert_eq!(get_type("TYPE: SOP"), Ok(("", ProblemType::SOP)));
assert_eq!(get_type("TYPE: HCP"), Ok(("", ProblemType::HCP)));
assert_eq!(get_type("TYPE: CVRP"), Ok(("", ProblemType::CVRP)));
assert_eq!(get_type("TYPE: TOUR"), Ok(("", ProblemType::TOUR)));
}
If this enum was to change a lot, our test should problably look different. It would probably enumerate every value in the Enum, convert it
to a string, then make sure the get_type
parser could parse it. However, since the parser is so small, this is basically unit testing
str::parse
and our macro derived from_str function.
###The DIMENSION property
1.1.4 DIMENSION : <integer>
For a TSP or ATSP, the dimension is the number of its nodes. For a CVRP, it is the total
number of nodes and depots. For a TOUR file it is the dimension of the corresponding
problem.
So here we’re parsing an integer. We can almost exactly copy the get_type
parser, except we tell str::parse
that we want an integer
back:
named!(get_dimension<&str, i64>,
map_res!(call!(kv, "DIMENSION"), str::parse::<i64>)
);
#[test]
fn test_dimension() {
let dimension = 8;
assert_eq!(
get_dimension(&format!("DIMENSION: {}", dimension)),
Ok(("", dimension))
);
}
And actually, since we specified the return type of our function as i64, Rust can infer what return type we want from parse, and we don’t even need to tell it!!!!:
named!(get_dimension<&str, i64>,
map_res!(call!(kv, "DIMENSION"), str::parse)
);
It’s almost like we could make our kv
function call parse for us, and have the generics machinery make us not have to call map_res everytime.
At first, as a Rust noobie I’m stopped here with the type magic. Also I wasn’t sure if the Nom macros would work with a trait bound parameter.
But then I ended up doing a bunch of trait bound type magic. I got tired of writing out the tests the exact same way, so I told myself; surely I could write a function that calls my parser with a value, and then checks that it is correctly parsed? I ended up with this:
fn test_kv<G: Display + Debug + PartialEq>(
kvfunc: fn(&str) -> Result<(&str, G), Err<&str>>,
key: &str,
value: G,
) {
assert_eq!(kvfunc(&format!("{}: {}", key, value)), Ok(("", value)));
}
So this takes a parser that works with KV’s, then a key, and a value to try parsing with. The Trait bounds are for
- Display -> that the type can be put into a string
- PartialEq -> Makes the values comparable
- Debug -> assert_eq! was complaining I just put them in as the compiler complained.
Now I can write my tests as
#[test]
fn test_capacity_kv() {
test_kv(get_capacity, "CAPACITY", 8)
}
Which is nicer, and leaves open the door for test_kv to maybe even generate some of the
valid values itself, in a kind of Property Based Testing scenario (Maybe I would just add the + Generate
Trait bound)? Nice.
NB: It didn’t work with the name_test, because somehow the lifetimes are mismatched. A combination of my experience level, macros making this opaque, lifetimes being hard to understand, and functions being generally hard to pass around in Rust made lose a few hours trying to fix it, but I’ll probably leave it.
I think it has to do with the kv
function making the key
variable have the same lifetime as its results. I wasn’t able to set
the key lifetime to a different lifetime
###Doing the type magic anyway
fn kv_parse<'a, T>(input: &'a str, key: &'a str) -> IResult<&'a str, T>
where
T: std::str::FromStr,
{
kv(input, key).and_then(|(i, v)| match str::parse::<T>(v) {
Ok(tv) => Ok((i, tv)),
Err(a) => Err(Err::Error((i, nom::error::ErrorKind::ParseTo))),
})
}
This made my simple key:value
parsers so clean:
fn get_dimension(input: &str) -> IResult<&str, i64> {
kv_parse(input, "DIMENSION")
}
fn get_edge_weight_type(input: &str) -> IResult<&str, EdgeWeightType> {
kv_parse(input, "EDGE_WEIGHT_TYPE")
}
Then I wasted a BUNCH of time trying to go further and storing a dictionary/map of KEY to type mappings (maybe in an enum?) so this could be autogenerated… It was a bunch of rabbit holing for no output.
###A first attempt at parsing the metadata into a coherent object. At this point I have all the pieces to put the metadata into a TSPLIB Problem object. Here’s what I’ve got:
#[derive(Debug, PartialEq, Eq, Clone)]
struct TSPLProblem {
dimension: i64,
coords: Vec<Coord>,
name: String,
comment: String,
problem_type: ProblemType,
edge_weight_type: EdgeWeightType,
}
And to parse it, I did a big do_parse
with some opt
’s around each line to make them all optional.
named!(parse_problem<&str, TSPLProblem>,
do_parse!(
name: opt!(get_name) >>
ptype: opt!(get_type) >>
comment: opt!(get_comment) >>
dimension: opt!(get_dimension) >>
ewt: opt!(get_edge_weight_type) >>
(TSPLProblem {
name:name.unwrap_or("").to_string(),
problem_type:ptype.unwrap(),
comment: comment.unwrap_or("").to_string(),
dimension:dimension.unwrap(),
edge_weight_type:ewt.unwrap_or(EdgeWeightType::EUC_2D),
coords: Vec::new(),
})
)
);
The constructor doesn’t take Options
however, so I either set a default value, or panic’ed with unwrap
. Will revisit later.
I’m not sure which ones are actually required at the moment.
Then I put in an empty list of coordinates
because they are in a different section and haven’t parsed them yet.
###Fixing the KV parser
As this parser stood, it wouldn’t work on multiline files. I wasn’t sure how to do it when starting so I had punted on it.
I took me some fiddling (one dead-end with the take_while function), but I found how to do it in the nom tests. I modified kv
to look like:
named_args!(kv<'a>(key: &'a str)<&'a str, &'a str>,
do_parse!(
tag_s!(key) >>
tag_s!(":") >>
space0 >>
value: not_line_ending >>
line_ending >>
(value)
)
);
So now my first version of the test_parse_problem
looked like this:
#[test]
fn test_parse_problem() {
let header = "NAME: berlin52
TYPE: TSP
COMMENT: 52 locations in Berlin (Groetschel)
DIMENSION: 52
EDGE_WEIGHT_TYPE: EUC_2D
";
let parsed = TSPLProblem {
name: "berlin52".to_string(),
problem_type: ProblemType::TSP,
comment: "52 locations in Berlin (Groetschel)".to_string(),
dimension: 52,
edge_weight_type: EdgeWeightType::EUC_2D,
coords: Vec::new(),
};
assert_eq!(parse_problem(header), Ok(("", parsed)))
}
And to my surprise, after the small multi-line modification, it passed immediately! Yay parser combinators and incrementally building up confidence in bigger and bigger pieces.
Now, what’s left is the (soy)“meat” of the format, to get the city list, which I had put off because it’s not in the super easy KEY: VALUE
format.
###Getting the cities There are a few ways of specifying TSPLIB problems. In the small samples I had seen, there would be a line called
NODE_COORD_SECTION
1 38.24 20.42
2 39.57 26.15
...
I was a little apprehensive when first starting, but now am ready to go! We’ll need:
- a tag_s(“NODE_COORD_SECTION”),
- a newline
- a Coord2D parser that gets the i, x, y values into a Coord2D
- then a
many1!
combinator that gives us a list of Coords
The coord2d parser looked like this:
named!(get_2d_coord<&str, Coord>,
do_parse!(
opt!(multispace) >>
i: digit >>
space >>
x: float >>
space >>
y: float >>
line_ending >>
(Coord( i.parse().unwrap(), n32(x), n32(y))))
);
#[test]
fn test_2d_coords() {
let input = " 1 1.0 3.0\n";
assert_eq!(get_2d_coord(input), Ok(("", Coord(1, n32(1.0), n32(3.0)))));
}
And the full node_coords_section like this:
named!(node_data_section<&str, Vec<Coord> >,
do_parse!(
tag_s!("NODE_COORD_SECTION") >>
line_ending >>
coords: many1!(get_2d_coord) >>
opt!(complete!(tag_s!("EOF\n"))) >>
(coords)
)
);
Eventually I settled on using the sep!() macro to talk about “lists of lines that look like X separated by a newline” and not putting the newline in the parsers for X.
###Making the order not matter
The current parser needs each metadata line to be in the right place, but the spec says they should be in any order.
For this, nom has a better way, using the permute!
macro, which can take child parsers in any order, and make them optional by adding ?
instead of
wrapping in opt.
So after a lot of messing around, I came up with :
fn build_problem(
(name, problem_type, comment, dimension, ewt, capacity, ewf, edf, ddt, nct): (
Option<&str>,
Option<ProblemType>,
Option<&str>,
Option<i64>,
Option<EdgeWeightType>,
Option<i64>,
Option<EdgeWeightFormat>,
Option<EdgeDataFormat>,
Option<DisplayDataType>,
Option<NodeCoordType>,
),
) -> TSPLProblem {
TSPLProblem {
name: name.unwrap_or("").to_string(),
problem_type: problem_type.unwrap(),
comment: comment.unwrap_or("").to_string(),
dimension: dimension.unwrap(),
capacity: capacity,
edge_weight_type: ewt.unwrap_or(EdgeWeightType::EUC_2D),
coords: Vec::new(),
edge_data_format: edf,
edge_weight_format: ewf,
node_coord_type: nct.unwrap_or(NodeCoordType::NO_COORDS),
display_data_type: ddt.unwrap_or(DisplayDataType::NO_DISPLAY),
}
}
fn parse_problem_perm(input: &str) -> IResult<&str, TSPLProblem> {
map!(
input,
permutation!(
get_name?,
get_type?,
get_comment?,
get_dimension?,
get_edge_weight_type?,
get_capacity?,
get_edge_weight_format?,
get_edge_data_format?,
get_display_data_type?,
opt!(get_node_coord_type)
),
build_problem
)
}
Which has a fair bit of duplication in it, but it’s not very bad, and it’s not too much of a problem. This section of the data format would rarely change
much. Additionally, this lets us do validation in the build_problem
function, where we could check if say the the dimension was set correctly.
There is currently a small bug in nom, the last parser in the list can’t have a ?
appended to it, so I used the opt!
macro instead.
This actually only reads the Metadata section, not the data section. In this case, we have a fairly simple setup, where can read first the metadata, then pass that along to a data parser who can determine what to look for. There will be a couple of dependencies between the metadata and the data section. One mentioned above for example, is the node_coord_type, which is either 2d or 3d, making the node data either a pair or a triple of numbers. Nom contains some logic for this (choosing a parser based on the output of a different parser), but here we have an even simpler solution. We can parse the metadata section first, and then configure exactly what we’ll be expecting to the data section parser. For example, a problem of type TSP shouldn’t have a DEPOT section.
##Filling out the format
Personally, I was only interested in the TSP/ATSP variants but I decided I could finish implementing the rest of the spec and maybe push a crate to help future Rust TSP writers . Lol I’m aware this is a niche within a niche but why not, it might be fun to push a crate.
The other kinds of problems that TSPLIB supports are Multi-vehicle routing, with depots. Imagine the Post office scheduling deliveries, or a taxi dispatch system.
Parser wise, the data sections for these problems are lists of different kinds of numbers.
comments powered by Disqus