"Parse, don't validate" through the years with C++

86 points
1/21/1970
6 days ago
by dwrodri

Comments


foobar1726

It seems like the C++98 example is the best by far? Keeps all error information while remaining concise and easy to understand. Not to mention 50 times faster. (Could be improved by adding some simple type aliases like BirthYear that explicitly start from 1900.)

IMO the main takeaway is that malformed input is not an exceptional state when parsing, and should be treated as a first class citizen. Everything else is yak shaving how you want to handle the (status, validObject) tuple coming from the parser.

3 days ago

philip-b

The compile time is 50 times faster, not the runtime.

3 days ago

_alphageek

The C++11 example is the weakest in the article by its own thesis. Public throwing constructor, no year check, no leap-year check, so Birthdate(0, 2, 30) constructs cleanly. The C++17/23 shape (private ctor + static factory) is the actual mechanical insight from King's essay. Make the constructor a function that can fail, so the type itself carries the proof.

3 days ago

simonask

Just to note, a throwing constructor is “just as good” as static factory method, provided you want to use exceptions for validation errors. Which you shouldn’t, but from the perspective of testing types as proof, it’s just as good.

3 days ago

noitpmeder

exactly, use std::expected as the return type, avoid exceptions, and make a failable factory constructor to build your type. Make invalid states unrepresentable!!!

3 days ago

dietr1ch

Aren't you time-travelling? std::expected is C++23 (so available starting from 2025-2027 xd)

https://en.cppreference.com/cpp/utility/expected

3 days ago

diath

It has been available since GCC 12.1 (May 2022), Clang 19.1 (Sep 2024), and Visual Studio 17.13 (2022~): https://godbolt.org/z/on1v6qdf3

These days compiler developers implement accepted standard features pretty fast.

3 days ago

noitpmeder

And tl::expected (a largely identical impl) has been available similarly as long!

3 days ago

gsliepen

The C example could have implemented a lot of validation just by checking the return value of sscanf():

    if (sscanf(user_input, "%4u-%2u-%2u", &year, &month, &day) != 3) {
        // return an error
    }
This still does not catch trailing garbage, but you could check for that as well:

    if (sscanf(user_input, "%4u-%2u-%2u%c", &year, &month, &day, &dummy) != 3) {
        // return an error
    }
The result would be 4 if there was at least one trailing character. Too bad there is still no std::scan() companion to C++23's std::print().
3 days ago

tialaramex

Although it feels intuitively as though a std::scan could make sense, it doesn't, at least not with the sort of API I've seen suggested

Consider a hypothetical Goose type, we can express any Goose usefully as output and, conveniently, some potential inputs could be read as a Goose successfully though most arbitrary strings cannot be understood as a Goose.

Providing std::print for Goose is simple, we've got a variable (or maybe a constant) of type Goose, we just emit the correct sequence of symbols. It's annoying to actually write all the boilerplate in C++ 23 but that's mechanical it's not actually tricky to do just very boring (and so hence maybe C++ 26 makes that easier via reflection)

But how could std::scan for Goose work? We need a Goose variable to potentially store the Goose if we read one, but how can we make a default Goose? No, each Goose is unique and there is no substitute, this can't work.

The std::scan idea seem attractive for simple almost untyped input, strings, integers, that sort of thing, but the whole point of "Parse, don't validate" is that you probably want to parse email addresses and ISBNs and ISO dates, you don't want a string, another string and a third string.

Rust's FromStr trait is more appropriate. Given a type implements FromStr we can parse any string to (maybe) get an instance of that type, but we don't need an "empty" instance first because we're doing the construction when we call the function.

3 days ago

gsliepen

Rust's FromStr only deals with parsing a single object. However, ideally std::scan() would be an exact counterpart of std::print() and would be able to parse multiple objects. I totally agree that the C way of passing references to already existing variables is not great. Ideally you return a tuple of objects, but then it becomes very annoying to specify the types. Maybe something like this?

    auto [value, text, goose] = std::scan<int, std::string, Goose>(input, "{} {} {}");
A halfway solution would be to have the hypothetical std::scan() take references to std::optional<>s or std::expected<>s:

    std::optional<int> value;
    std::optional<std::string> text;
    std::optional<Goose> goose;
    /* auto result = */ std::scan(input, "{} {} {}", value, text, goose);
The latter would be type safe, close to how scanf() works, but less satisfying from a functional programming standpoint.

Orthogonal to that, adding support for scanning a Goose would be just like how you add a formatter for it, and would be quite similar to a Rust trait. One could imagine having to define something like this:

    template<>
    struct std::scanner<Goose> {
        constexpr auto parse(std::format_parse_context& ctx) {…}
        auto scan(std::format_context& ctx) const -> std::optional<Goose> {…}
    };
3 days ago

MarsIronPI

Heh, I can especially tell the first code example is LLM-generated. Humans don't usually write comments like:

   // There are a few ways to let API callers bring their own 
   // memory, as they would in a no-malloc environment and this
   // stack-friendly c'tor is a stand-in for that. 
There's just something about this comment that doesn't feel right. I've seen these kinds of phrasings in LLM output before but I'm not sure exactly how to describe them.
3 days ago

mayoff

The second sentence of your summary is fine, but I don’t like the first sentence:

> Use your language’s type system to parse unstructured inputs.

We don’t use the type system to parse. We use the type system to provide evidence (also called a proof or a witness) that parsing was successful, and we rely on the language’s access control facilities (public/private) and the soundness of its type system to prevent fabrication of false evidence.

3 days ago

usefulcat

I don't see how this is in any way preferable to having an ordinary default constructor that does the same thing:

    // There are a few ways to let API callers bring their own 
    // memory, as they would in a no-malloc environment and this
    // stack-friendly c'tor is a stand-in for that. 
    static Birthdate epoch() { return Birthdate(1900, 1, 1); }
3 days ago

plorkyeran

Some readers will expect Birthdate() to be equivalent to Birthdate(0, 0, 0), and naming it Birthdate::epoch() makes it clear that it is not that. I don't think it's worth it, but there is an upside.

3 days ago

[deleted]
3 days ago

bregma

Author has used LLMs to generate Java code in C++. It detracts from his point.

4 days ago

pjmlp

What Java code?

Regardless of how they might have used LLMs, I tend to have an issue with this kind of complaint, given the C++ example code on the Design Patterns: Elements of Reusable Object-Oriented Software book, released in 1994, 2 years before Java was made public.

Or the examples from "Using the Booch Method: A Rational Approach", "Designing Object Oriented C++ Applications Using The Booch Method", or "Using the Booch Method: A Rational Approach".

Additional there are enough framework examples starting with Turbo Vision in 1990, MacAPP in 1989, OWL in 1991, MFC in 1992,....

Somehow a C++ style that was prevalent in the industry between 1990 and 1996, that I bet plenty of devs still have to maintain in 2026, has become "Java in C++".

3 days ago

bregma

> What Java code?

A class with a passel of static member functions is Java code. It is not in any way idiomatic C++ code which has had namespace-level ("free") functions since it was invented as C-with-classes many decades ago. Using classes holding a whole lot of static member functions is strongly frowned on in the professional C++ community.

2 days ago

pjmlp

Certainly not the professional C++ comunity that still uses frameworks born in the 1990's predating Java, or game engines.

2 days ago

antonvs

> Somehow

There's not much mystery about that - Java took that approach and ran with it, and now has much greater mindshare than C++.

Also, the mid-90s were before most software developers working today were born, I suspect. They'd have to go find a graybeard and ask them to tell them tales of yore, to find out about any of this.

3 days ago

pjmlp

We gladly tell bonefire tales. :)

3 days ago

SuperV1234

No, it doesn't.

3 days ago

jsymolon

First thought, assuming that birth year starts at 1900 is bad for a number of reasons; one of which, "process this list of authors and ..."

What about everyone born before 1900?

4 days ago

alpinisme

It’s a contrived example. And I have to assume the author intended it to be contrived given that he also put an upper bound at 1999 in an article written in 2026 in an industry that skews young.

But the pattern applies regardless of the validation logic.

4 days ago

psychoslave

Assuming it is necessarily known which is the birth year of anyone assumed to have been in existence is already a big hypothesis if we go in that direction.

3 days ago

Neywiny

Or what if they were born after 1999?

It's just a toy example not a production ready birthday validation library.

4 days ago

[deleted]
4 days ago

blt

I'm not a Haskell programmer, but from my limited awareness: Wouldn't they want to encode the restriction that April 31 doesn't exist directly in the type system instead of using raw integers for the underlying struct?

3 days ago

kstenerud

C is perfectly capable of type-driven design. He's already got the type (struct), and although C is a bit limited, he can:

* return pointer-or-null

* choose "invalid" sentinel values and then use birthdate_is_valid(...) to check validity.

* Add an is_valid bool field (or even an error enum like in the C++23 example)

* Add an out field in the constructor function for the error code (similar to how ObjC does things).

3 days ago

wk_end

The point of parse-don't-validate is that the type checker prevents you from having a value of a particular type that's invalid.

Pointer-or-NULL doesn't work, because all pointers are nullable in C; you can always have a Foo* (NULL) that's doesn't actually point to a valid Foo.

Invalid sentinel values are definitionally values of a particular type that are invalid. Same with an is_valid field.

An out field in the constructor means that whatever you actually return in the case of an error is going to be a well-typed Foo that's invalid.

3 days ago

kstenerud

My point is that you do the checking at the call site, and then use a static analysis tool or an AI to enforce checking the result right after calling parse_birthday.

Sure, Optional is more elegant, but the end result is the same: Now none of the other code needs to validate; it's already been verified valid at all points where a parse error could have occurred.

C may not be an easy language, but with the right tooling you can make code safer, and idioms like parse-dont-validate possible.

3 days ago

mrkeen

Cool, incredibly low bar.

All four of your examples are validate.

Know any languages that are worse than C at this?

3 days ago

tech_hutch

Or use an out field for the type itself, and use the return value for an error code (or just a bool). A common pattern in C#.

3 days ago

rienbdj

C++ could use some do-notation

4 days ago

marcosdumay

Abstracting any part of code structure in C++ is a wasps nest that will attack you back.

3 days ago

lstodd

Did you mean "abstract you back"?

Being abstracted by code you just wrote is quite a painful experience, yes.

3 days ago

actionfromafar

Disregarding the article for a second, has anyone else had the pattern that "parse don't validate" makes sense in object oriented style, but less sense in functional style programming? Like parsing and validating blurs into each other.

3 days ago

LittleLily

In my experience it makes even more sense in functional programming languages, not less, since they usually also have more powerful type systems that help with actually representing parsed vs unparsed data.

3 days ago

gspr

> Disregarding the article for a second, has anyone else had the pattern that "parse don't validate" makes sense in object oriented style, but less sense in functional style programming?

Parse, don't validate was written around Haskell!

3 days ago

actionfromafar

What I tried and apparently failed to express with "parsing and validating blurs into each other." was that parsing more easily becomes "just what you do" in functional style of programming. To the point that nowadays I can no longer really remember what I did back when I tried to "validate" things instead of parsing them.

3 days ago

andrepd

The tl;dr is that instead of representing emails as type String and manually sprinkling is_email(str) throughout your code, you represent as type Email, which has a function parse(String) -> Option<Email>. The type system then ensures the checks are present whenever they have to be, and nowhere else.

This is extremely natural to do in a language like Haskell or Rust. And incredibly unnatural to do in C++ for instance.

3 days ago

short_sells_poo

I hope this is not trolling so I'll bite. It is incredibly natural to represent an object, such as an email, as an Email class in object oriented languages like C++. It'd then have a constructor that accepts a string and constructs the email object from said string, or maybe a parse(string) -> Option<Email> thingy. The type system then ensures the checks are present whenever they have to be, and nowhere else.

Tl;dr: there's nothing extra that functional or OO programming give you here. Both allow you to represent the problem in a properly typed fashion. Why would you represent an email as a string unless you are a) deeply inexperienced or b) have some really good reason to drop all the benefits of a strongly typed language?

3 days ago

bananaboy

I completely agree with you but I think sometimes folks carry some piece of data around as a string or int instead of something more concrete like a class or a strongly typed enum etc purely out of laziness!

3 days ago

MarsIronPI

I think the old Lisp tradition of using lists for everything is related to this somehow. On the other hand, in Common Lisp programmers can define custom types that have to fulfill a predicate function. Then, if they declare the types of their functions, most implementations will generate type-checking code unless instructed not to. So in Common Lisp you can use lists for everything but still have type-checking, at some cost to efficiency. :D

3 days ago

leodavi

Well, in C++ the constructor must return a value of its class type - you can't return an Option<T> from a constructor on T, for example, and since constructors are the canonical way to construct an object, it creates stylistic and idiomatic friction when you start using free functions to create a Maybe<T> instead of constructors.

3 days ago

alphainfo

[dead]

3 days ago