This is a cleaned up overview of what I did to implement at text based parser with Nom, which should serve as maybe 35% tutorial.

Nom is a parser combinator based parser library in Rust. Other parser libraries in Rust are pest which is PEG based, and LALRPOP which generates LR(1) parsers.

Parser Combinators

Parsing is an unsolved problem, there’s no right answer like ISO 8609 for dates or serializable isolation for DBs. Here is a short overview of the challenges .

Essentially there are competing ways of describing your language, and many theoretical trade offs to make. No version is perfect.

Parser combinators are a version whos pros are a) easy to read b) easy to build up from bottom up

and the main con people bring up is that they are equivalent in power to LLR(1) parsers, and therefore have the same ambiguity/readability issues.

The one thing that I like about nom/combinators, is how testable they are! Combined with Rusts’ inline testing it makes it really easy to build up.

About Nom

Nom has changed a lot over time and you may find outdated documentations describing old macros or older ways of doing things.

Mainly, nom implements the Combinator-ing through macros that can take any number of arguments. This sometimes makes them hard to debug when something happens.

The other thing to know about nom is its error type. The parsers that you write and combinate take your input (&str or &bytes) and all return a IResult type, which is equivalent to a Result<(Input, Output), nomError> type (and you can choose the error type later if you want). The Input in the Result carries the as-of-yet unparsed string as you apply a parser. So applying a parser for “Hello” to “Hello World” would return a Ok((" World", "Hello")). We’ll see more of this later.

The format

So, what we’ll be parsing are TSPLIB problem definitions, which mostly look like this:

NAME: berlin52
TYPE: TSP
COMMENT: 52 locations in Berlin (Groetschel)
DIMENSION: 52
EDGE_WEIGHT_TYPE: EUC_2D
NODE_COORD_SECTION
1 565.0 575.0
2 25.0 185.0
3 345.0 750.0
4 945.0 685.0
5 845.0 655.0
6 880.0 660.0
7 25.0 230.0
... (edges omitted for brevity)
EOF

The spec is here.

The reason for the spec is that there are a lot of other types of TSP-like problems (eg asymmetric TSP) and different ways to state the problem. Examples are solving a “all cities are connected by their Euclidean distance” problem vs a “take these particular edges with these weights” problem, and an option to describe adjacency lists vs a full adjacency matrix.

Here is a simple parser:

named!(name<&str, &str>, 
tag_s!("NAME:")
);

Will make a function called name, which will parse the atom “NAME:“, for our problems’ name. The type signature is <input_type (which is string), your function's output type>. The input type can be &str or &bytes. Nom has a slight lean towards parsing streams of binary data, but is eminently usable towards our case of non-streaming text data.

tag_s! will be our building block; it parses hard-coded strings. (The non _s function tag takes bytes)

Ok, but what if I don’t care about the NAME token, but the actual string that comes after?

named!(name<&str, &str>, 
preceded!("NAME:")
);

Gets you whatever comes after NAME:. But wait, I think there’s optional whitespace after the colon! Or at least I want to support it. This gets us to our next bread & butter function, do_parse! (used to be called chain! I think), which lets you chain things, and also map your result to a type.

named!(name<&str,&str>,
   do_parse!(
        tag_s!("NAME:) >>
        space0 >>
        value: rest >>
        (value.trim_end())
	)
);

So with do_parse you can chain parsers to parse, say, a whole line, and you can give names to some of the parsers’ outputs, and return it. Here I use the space0 parser which is 0 or more whitespace characters, and the rest parser, which gives you the rest of an input. (This works on a single line) After chaining these parsers together, we return the value, but apply trim_end() to it in case there are any trailing whitespaces/tabs.

Nice! But a lot of these headers are defined as KEY: VALUE pairs, so I’d like to make one that can be given a KEY and return me the VALUE.

Enter named_args:

named_args!(kv<'a>(key: &'a str)<&'a str, &'a str>,
   do_parse!(
        tag_s!(key) >>
        tag_s!(":") >>
        space0 >>
        value: rest >>
        (value)
	)
);

Sadly I had to put lifetimes in, and I don’t understand what these mean (because of the macro interaction).

This makes a function called kv which takes both input :str, and a key:str. Unfortunately you have to start giving lifetimes here, but it’s not too bad.

Now my test looks like

#[test]
fn name_test() {
    assert_eq!(kv("NAME: some_name45", "NAME"), Ok(("", "some_name45")))
}

Other parsers can call the kv function with the call! macro. Let’s use it again!

The TYPE line.

I like this example because the difficulty (after an abrupt jump into the Nom pool) is ramping up real slow. Next is the TYPE property, which is an ENUM of

1.1.2 TYPE : <string>
Specifies the type of the data. Possible types are
TSP Data for a symmetric traveling salesman problem
ATSP Data for an asymmetric traveling salesman problem
SOP Data for a sequential ordering problem
HCP Hamiltonian cycle problem data
CVRP Capacitated vehicle routing problem data
TOUR A collection of tours

So this is a straight up rust enum:

#[derive(Debug, PartialEq, Eq, EnumString)]
enum ProblemType {
    TSP,
    ATSP,
    SOP,
    HCP,
    CVRP,
    TOUR,
}

I wanted to have some automatically derived FromString function. Turns out this is called the FromStr trait, and you usually write it yourself, but I found a crate called strum which gives us a derive macro we can use. So now we could do

let problem_type: ProblemType = "TSP".parse().unwrap()

Let’s go implement our TYPE parser!

What I like about parser combinators is that once you get them right, they’re really elegant! This is what the TYPE parser looks like:

named!(get_type<&str, ProblemType>,
    map_res!(call!(kv, "TYPE"), str::parse)
);

Oh yes, this introduces another combinator, map_res!. It takes the output from a combinator and applies a function to it, I believe exactly like the regular Result.map. And that’s all we have to do! I made this quick and dirty unit test with a vim macro:

#[test]
fn test_type() {
    assert_eq!(get_type("TYPE: TSP"), Ok(("", ProblemType::TSP)));
    assert_eq!(get_type("TYPE: ATSP"), Ok(("", ProblemType::ATSP)));
    assert_eq!(get_type("TYPE: SOP"), Ok(("", ProblemType::SOP)));
    assert_eq!(get_type("TYPE: HCP"), Ok(("", ProblemType::HCP)));
    assert_eq!(get_type("TYPE: CVRP"), Ok(("", ProblemType::CVRP)));
    assert_eq!(get_type("TYPE: TOUR"), Ok(("", ProblemType::TOUR)));
}

If this enum was to change a lot, our test should problably look different. It would probably enumerate every value in the Enum, convert it to a string, then make sure the get_type parser could parse it. However, since the parser is so small, this is basically unit testing str::parse and our macro derived from_str function.

The DIMENSION property

1.1.4 DIMENSION : <integer>
For a TSP or ATSP, the dimension is the number of its nodes. For a CVRP, it is the total
number of nodes and depots. For a TOUR file it is the dimension of the corresponding
problem.

So here we’re parsing an integer. We can almost exactly copy the get_type parser, except we tell str::parse that we want an integer back:

named!(get_dimension<&str, i64>,
    map_res!(call!(kv, "DIMENSION"), str::parse::<i64>)
);
#[test]
fn test_dimension() {
    let dimension = 8;
    assert_eq!(
        get_dimension(&format!("DIMENSION: {}", dimension)),
        Ok(("", dimension))
    );
}

And actually, since we specified the return type of our function as i64, Rust can infer what return type we want from parse, and we don’t even need to tell it!!!!:

named!(get_dimension<&str, i64>,
    map_res!(call!(kv, "DIMENSION"), str::parse)
);

It’s almost like we could make our kv function call parse for us, and have the generics machinery make us not have to call map_res everytime. At first, as a Rust noobie I’m stopped here with the type magic. Also I wasn’t sure if the Nom macros would work with a trait bound parameter.

But then I ended up doing a bunch of trait bound type magic. I got tired of writing out the tests the exact same way, so I told myself; surely I could write a function that calls my parser with a value, and then checks that it is correctly parsed? I ended up with this:

fn test_kv<G: Display + Debug + PartialEq>(
    kvfunc: fn(&str) -> Result<(&str, G), Err<&str>>,
    key: &str,
    value: G,
) {
    assert_eq!(kvfunc(&format!("{}: {}", key, value)), Ok(("", value)));
}

So this takes a parser that works with KV’s, then a key, and a value to try parsing with. The Trait bounds are for

  • Display -> that the type can be put into a string
  • PartialEq -> Makes the values comparable
  • Debug -> assert_eq! was complaining I just put them in as the compiler complained.

Now I can write my tests as

#[test]
fn test_capacity_kv() {
    test_kv(get_capacity, "CAPACITY", 8)
}

Which is nicer, and leaves open the door for test_kv to maybe even generate some of the valid values itself, in a kind of Property Based Testing scenario (Maybe I would just add the + Generate Trait bound)? Nice.

NB: It didn’t work with the name_test, because somehow the lifetimes are mismatched. A combination of my experience level, macros making this opaque, lifetimes being hard to understand, and functions being generally hard to pass around in Rust made lose a few hours trying to fix it, but I’ll probably leave it.

I think it has to do with the kv function making the key variable have the same lifetime as its results. I wasn’t able to set the key lifetime to a different lifetime

Doing the type magic anyway

fn kv_parse<'a, T>(input: &'a str, key: &'a str) -> IResult<&'a str, T>
where
    T: std::str::FromStr,
{
    kv(input, key).and_then(|(i, v)| match str::parse::<T>(v) {
        Ok(tv) => Ok((i, tv)),
        Err(a) => Err(Err::Error((i, nom::error::ErrorKind::ParseTo))),
    })
}

This made my simple key:value parsers so clean:

fn get_dimension(input: &str) -> IResult<&str, i64> {
    kv_parse(input, "DIMENSION")
}

fn get_edge_weight_type(input: &str) -> IResult<&str, EdgeWeightType> {
    kv_parse(input, "EDGE_WEIGHT_TYPE")
}

Then I wasted a BUNCH of time trying to go further and storing a dictionary/map of KEY to type mappings (maybe in an enum?) so this could be autogenerated… It was a bunch of rabbit holing for no output.

A first attempt at parsing the metadata into a coherent object.

At this point I have all the pieces to put the metadata into a TSPLIB Problem object. Here’s what I’ve got:

#[derive(Debug, PartialEq, Eq, Clone)]
struct TSPLProblem {
    dimension: i64,
    coords: Vec<Coord>,
    name: String,
    comment: String,
    problem_type: ProblemType,
    edge_weight_type: EdgeWeightType,
}

And to parse it, I did a big do_parse with some opt’s around each line to make them all optional.

named!(parse_problem<&str, TSPLProblem>,
    do_parse!(
        name: opt!(get_name) >>
        ptype: opt!(get_type) >>
        comment: opt!(get_comment) >>
        dimension: opt!(get_dimension) >>
        ewt: opt!(get_edge_weight_type) >>
        (TSPLProblem {
            name:name.unwrap_or("").to_string(),
            problem_type:ptype.unwrap(),
            comment: comment.unwrap_or("").to_string(),
            dimension:dimension.unwrap(),
            edge_weight_type:ewt.unwrap_or(EdgeWeightType::EUC_2D),
            coords: Vec::new(),
        })
    )
);

The constructor doesn’t take Options however, so I either set a default value, or panic’ed with unwrap. Will revisit later. I’m not sure which ones are actually required at the moment. Then I put in an empty list of coordinates because they are in a different section and haven’t parsed them yet.

Now, what’s left is the (soy)“meat” of the format, to get the city list, which I had put off because it’s not in the super easy KEY: VALUE format.

Getting the cities

There are a few ways of specifying TSPLIB problems. In the small samples I had seen, there would be a line called

NODE_COORD_SECTION
 1 38.24 20.42
 2 39.57 26.15
 ...

I was a little apprehensive when first starting, but now am ready to go! We’ll need: * a tag_s(“NODE_COORD_SECTION”), * a newline * a Coord2D parser that gets the i, x, y values into a Coord2D * then a many1! combinator that gives us a list of Coords

The coord2d parser looked like this:

named!(get_2d_coord<&str, Coord>,
    do_parse!(
        opt!(multispace) >>
         i: digit >>
            space >>
         x: float >>
            space >>
         y: float >>
         line_ending >>
         (Coord( i.parse().unwrap(), n32(x), n32(y))))
);

#[test]
fn test_2d_coords() {
    let input = " 1 1.0 3.0\n";
    assert_eq!(get_2d_coord(input), Ok(("", Coord(1, n32(1.0), n32(3.0)))));
}

And the full node_coords_section like this:

named!(node_data_section<&str, Vec<Coord> >,
    do_parse!(
        tag_s!("NODE_COORD_SECTION") >>
        line_ending >>
        coords: many1!(get_2d_coord) >>
        opt!(complete!(tag_s!("EOF\n"))) >>
        (coords)
    )
);

I still haven’t figured out the EOF/end of input business, but it passes the current unit test. Rust zealously implements IEEE754 and doesn’t let you compare NaNs, which are technically not a number so “incomparable”. So I use the noisy_floats crate that panics instead of holding NaNs, and has the N32 (and other sizes) type that holds a non-NaN f32. This might be a real shift we see in our software better implementing specs, but this spec was written in the 90s and doesn’t care too much.

I’m not sure where to put this, but eventually I settled on using the sep!() macro to talk about “lists of lines that look like X separated by a newline” and not putting the newline in the parsers for X.

Making the order not matter

The current parser needs each metadata line to be in the right place, but the spec says they should be in any order. For this, nom has a better way, using the permute! macro, which can take child parsers in any order, and make them optional by adding ? instead of wrapping in opt. So after a lot of messing around, I came up with :

fn build_problem(
    (name, problem_type, comment, dimension, ewt, capacity, ewf, edf, ddt, nct): (
        Option<&str>,
        Option<ProblemType>,
        Option<&str>,
        Option<i64>,
        Option<EdgeWeightType>,
        Option<i64>,
        Option<EdgeWeightFormat>,
        Option<EdgeDataFormat>,
        Option<DisplayDataType>,
        Option<NodeCoordType>,
    ),
) -> TSPLProblem {
    TSPLProblem {
        name: name.unwrap_or("").to_string(),
        problem_type: problem_type.unwrap(),
        comment: comment.unwrap_or("").to_string(),
        dimension: dimension.unwrap(),
        capacity: capacity,
        edge_weight_type: ewt.unwrap_or(EdgeWeightType::EUC_2D),
        coords: Vec::new(),
        edge_data_format: edf,
        edge_weight_format: ewf,
        node_coord_type: nct.unwrap_or(NodeCoordType::NO_COORDS),
        display_data_type: ddt.unwrap_or(DisplayDataType::NO_DISPLAY),
    }
}
fn parse_problem_perm(input: &str) -> IResult<&str, TSPLProblem> {
    map!(
        input,
        permutation!(
            get_name?,
            get_type?,
            get_comment?,
            get_dimension?,
            get_edge_weight_type?,
            get_capacity?,
            get_edge_weight_format?,
            get_edge_data_format?,
            get_display_data_type?,
            opt!(get_node_coord_type)
        ),
        build_problem
    )
}

Which has a fair bit of duplication in it, but it’s not very bad, and it’s not too much of a problem. This section of the data format would rarely change much. Additionally, this lets us do validation in the build_problem function, where we could check if say the the dimension was set correctly.

There is currently a small bug in nom, the last parser in the list can’t have a ? appended to it, so I used the opt! macro instead.

This actually only reads the Metadata section, not the data section. In this case, we have a fairly simple setup, where can read first the metadata, then pass that along to a data parser who can determine what to look for. There will be a couple of dependencies between the metadata and the data section. One mentioned above for example, is the node_coord_type, which is either 2d or 3d, making the node data either a pair or a triple of numbers. Nom contains some logic for this (choosing a parser based on the output of a different parser), but here we have an even simpler solution. We can parse the metadata section first, and then configure exactly what we’ll be expecting to the data section parser. For example, a problem of type TSP shouldn’t have a DEPOT section.

Filling out the format

Personally, I was only interested in the TSP/ATSP variants but I decided I could finish implementing the rest of the spec and maybe push a crate to help future Rust TSP writers. Lol I’m aware this is a niche within a niche but why not, it might be fun to push a crate.

Parser wise, the data sections for these problems are lists of different kinds of numbers. They look like a SECTION_TITLE on one line, then a list of numbers.

For this I made a get_section parser, which is given a title, and then applies a numbers_on_line parser repeatedly. A second to get_section is a function that takes these Vecs of numbers, and turns them into the correct data type, eg EdgeWeightData.

fn numbers_on_line(input: &str) -> IResult<&str, Vec<f32>> {
    separated_list!(input, space1, float)
}
#[test]
fn test_numbers_on_line() {
    assert_eq!(
        numbers_on_line("1.2 2 3 4"),
        Ok(("", vec![1.2, 2.0, 3.0, 4.0]))
    );
}

fn get_section<'a, T>(
    input: &'a str,
    section_title: &'a str,
    line_parser: fn(Vec<f32>) -> Option<T>,
) -> IResult<&'a str, Vec<T>>
{
    do_parse!(
        input,
        tag!(section_title)
            >> line_ending
            >> space0
            >> payload: separated_list!(multispace1, map_opt!(numbers_on_line, line_parser))
            >> space0
            >> opt!(line_ending)
            >> opt!(complete!(tag!("EOF\n")))
            >> (payload)
    )
}

A caveat here is that numbers_on_line has to use the float parser every time, and so casts every number to f32, and we have to truncate them back to ints.

Here is how we’d use get_section to parse NODE_COORD_SECTION:

fn parse_coord_vec(input: Vec<f32>) -> Option<Coord> {
    if input.len() != 3 {
        None
    } else {
        Some(Coord(input[0] as i64, n32(input[1]), n32(input[2])))
    }
}

Then we can call!(get_section, "NODE_COORD_SECTION", parse_coord_vec) to get a Vec<Coord>.

Here is my full parse_data_section, which is given a problem header and a string, and returns a FullProblem (type names all subject to change towards being less bad!)

fn parse_data_section<'a>(input: &'a str, header: TSPLMeta) -> IResult<&'a str, FullProblem> {
    //Here we should be building a list of sections that we are expecting based
    //on the header data. At the moment we are making every section optional.
    //So required sections are able to be ommitted and unrelated/nonsensical sections are
    //allowed, which is bad.
    let edge_parser = match header.edge_data_format {
        Some(EdgeDataFormat::ADJ_LIST) => parse_adjacency_vec,
        Some(EdgeDataFormat::EDGE_LIST) => parse_edgedata_vec,
        None => parse_edgedata_vec, //TODO: omit the EDGE_DATA_SECTION if there is no Format for it
    };
    map!(
        input,
        permutation!(
            complete!(call!(get_section, "NODE_COORD_SECTION", parse_coord_vec))?,
            complete!(call!(get_section, "DEPOT_SECTION", parse_depot_vec))?,
            complete!(call!(get_section, "DEMAND_SECTION", parse_demand_vec))?,
            complete!(call!(get_section, "EDGE_DATA_SECTION", edge_parser))?,
            complete!(call!(get_section, "FIXED_EDGES_SECTION", parse_edge_vec))?,
            complete!(call!(get_section, "DISPLAY_DATA_SECTION", parse_coord_vec))?, //TODO make this either 2d or 3d based on DISPLAY_DATA_TYPE
            complete!(call!(get_section, "TOUR_SECTION", parse_tour_vec))?,
            complete!(call!(get_section, "EDGE_WEIGHT_SECTION", parse_weights_vec))?
        ),
        |parser_output| build_full_problem(&header, parser_output),
        }
    )
}

I had to wrap all of my parsers with complete!, and I’m not sure why. They were already all opt!ional, but returning an Incomplete.

And that’s it! It was frustrating at time and some of the lifetime (passing around &str) and errors (and lifetimes of errors!) were confusing, but that’s mostly beginner pains. This was my first Rust project.

The semi-finished crate is online here.

comments powered by Disqus