September 22, 2023

Experimenting with Clojure: Advent of Code 2022 - Day 7

  1. Constructing the Filesystem
  2. Calculating Directory Sizes
  3. Filter and Sum the Directories
  4. Final Remarks

It should go without saying, but spoilers for Advent of Code 2022 Day 7. And while I'm spoiling things up front, the namespaces I used to tackle the problem are clojure.zip and clojure.walk. Read on to see how!1

The puzzle I'll be covering in this article is as follows:

Given some terminal output that show the results of navigating a filesystem, find all the directories with a total size of at most 100 000 and sum the sizes of those directories. The given sample terminal output (puzzle input) is:

$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k

There's probably some clever way to handle this, but my first thought was to break the problem up into very simple steps and not worry too much about algorithmic efficiency. I figured in order to solve this puzzle, I would have to:

  1. Parse the given input into a filesystem tree
  2. Calculate the size of each directory in the tree
  3. Walk the tree and filter the directory nodes with size of at most 100 000
  4. Sum the sizes of the filtered directory nodes

Constructing the Filesystem

It seemed to me that step 1 would require some interesting recursion with backtracking, where I would have to keep track of the tree so far so that cd .. would take me to the right place. I didn't find the thought of doing that particularly exciting, but then I remembered an interesting discovery I made a while ago while browsing Clojure API docs: a namespace called clojure.zip.

The documentation for this little namespace is even more terse than some of the tersest stuff in clojure.core, but it was intriguing enough to me at the time that I'd read some articles and watched some videos about it; I knew from prior exploration that this namespace provides a "functional zipper." A zipper is essentially a cursor into a hierarchical data structure that can be moved around to point to different nodes (e.g. up, down, right) and make changes to the structure at the cursor (e.g. append-child to append a new node at the cursor). That seemed right for the problem of parsing the given input into a filesystem tree, as it does all the heavy lifting of traversing children, appending filesystem nodes, and navigating back up the tree in response to cd ...

I'll walk through the code piece by piece below, but here's all the parsing code up front for reference (note that in the following code, clojure.zip is aliased as z):

(defn dir [name]
  {:name name :children []})

(defn file [name size]
  {:name name :size (parse-long size)})

(defn dir? [file]
  (contains? file :children))

(defn fs-zipper [root]
  (z/zipper dir? :children (fn [n coll] (assoc n :children coll)) root))

(defn insert-dir [loc name]
  (z/append-child loc (dir name)))

(defn insert-file [loc name size]
  (z/append-child loc (file name size)))

(defn dir-loc [name loc]
  (let [file (z/node loc)]
    (when (and (dir? file) (= (:name file) name))
      loc)))

(defn nav-to-dir [loc name]
  (some #(dir-loc name %) (->> loc z/down (iterate z/right))))

(defn parse-command
  [loc input]
  (condp re-find input
    #"\$ cd /" (fs-zipper (dir "/"))
    #"\$ ls" loc
    #"dir (.+)" :>> (fn [[_ name]] (insert-dir loc name))
    #"(\d+) (.+)" :>> (fn [[_ size name]] (insert-file loc name size))
    #"cd \.\." (z/up loc)
    #"cd (.+)" :>> (fn [[_ name]] (nav-to-dir loc name))))

The workhorse of the above code is parse-command, and it's intended to be used as a reducing function to build up the filesystem tree using a zipper (the terminology used by the zipper functions to indicate the current location you're pointing to in the tree is "loc"). Once the tree has been build up with the zipper, we can "unwrap" the data structure by calling z/root on it. Parsing the sample input into a filesystem tree looks like this:

(defn parse [input]
  (->> input
       str/split-lines
       (reduce parse-command nil)
       z/root))
       
(parse sample-input)

So what does parse-command do?

Given a loc and line of input, it matches one of the six possibilities and returns the resulting loc. The possibilities are:

  1. $ cd / creates a new zipper with it's starting node as the directory /. I probably could (should?) have done this outside of parse-command since it only happens exactly once, but I liked the uniformity of having all the possible commands parsed and appearing here in this one place.
  2. $ ls is a no-op and returns the loc unchanged.
  3. dir (.+) matches the listing of a directory and captures the directory name. condp has a neat little feature where you can have a clause take the form test-expr :>> result-fn instead of the typical test-expr result-expr. With this form, result-fn is meant to be a function that takes the result of the predicate: in this case, that means that the function will be called with the result of (re-find #"dir (.+)" input) as its input. The result of re-find is a vector with its first element being the entire match, and each subsequent element being a capture group. For example, if the input were "dir a", then the result of re-find would be ["dir a" "a"], and so our anonymous result-fn is simply destructuring this vector, calling (insert-dir loc "a") and returning the result.
  4. (\d+) (.+) is similar to the above but matches the listing of a file and captures the file size and name. For example, if the input were "14848514 b.txt", then we would end up calling (insert-file loc "b.txt" "14848514") and returning the result.
  5. cd .. returns the result of navigating "up" from the current loc. This was the part that really motivated me to try out zippers: it's just so clean and maps one-to-one with the problem statement, no extra gymnastics on my part to keep track of things.
  6. cd (.+) takes care of navigation down the tree into subdirectories. It used the same condp feature as the directory and file cases. For example, if the input were "cd a", then we would end up calling (nav-to-dir loc "a") and returning the result.

Next, let's take a closer look at the functions that manipulate our tree. We'll start with creating the zipper:

(defn fs-zipper [root]
  (z/zipper dir? :children (fn [n coll] (assoc n :children coll)) root))

In order to create a zipper, we need to provide a few things:

  1. A predicate that determines whether a node is a branch (in our case, filesystem directories are branches)
  2. A function to retrieve the child nodes of a branch
  3. A function to set the child nodes of a given node to the given collection
  4. The root node of the zipper (the root directory in our case)

Inserting directories and files is fairly straightforward:

(defn insert-dir [loc name]
  (z/append-child loc (dir name)))

(defn insert-file [loc name size]
  (z/append-child loc (file name size)))

In both cases, given a loc and the regex match from parse-command, we destructure the match to get the relevant bits and append the appropriate node. Navigating to a given directory is a little more involved, but not by much:

(defn dir-loc [name loc]
  (let [file (z/node loc)]
    (when (and (dir? file) (= (:name file) name))
      loc)))

(defn nav-to-dir [loc name]
  (some #(dir-loc name %) (->> loc z/down (iterate z/right))))

dir-loc returns the given loc if it represents a directory with the given name, or nil otherwise. This allows the function to play nice with some in nav-to-dir, which returns the first truthy value. So what exactly is nav-to-dir doing? First, it creates a lazy seq of all the children of the node at loc: z/down navigates to the first child and (iterate z/right) iterates through them all. Then it calls some on that seq, looking for the first node that is a directory with the given name.

For clarity, let's look at the filesystem tree generated by parsing the sample input. For convenience, here's the sample input again:

$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k

This gives the following data structure:

{:name "/"
 :children [{:name "a"
             :children [{:name "e"
                         :children [{:name "i" :size 584}]}
                        {:name "f" :size 29116}
                        {:name "g" :size 2557}
                        {:name "h.lst" :size 62596}]}
            {:name "b.txt" :size 14848514}
            {:name "c.dat" :size 8504156}
            {:name "d"
             :children [{:name "j" :size 4060174}
                        {:name "d.log" :size 8033020}
                        {:name "d.ext" :size 5626152}
                        {:name "k" :size 7214296}]}]

Calculating Directory Sizes

Now that we have the filesystem represented, the next step is to determine the sizes of the directories. In a language like Java, this is fairly straight forward: perform a post-order depth-first traversal of the tree and modify each directory node to have a file size being the sum of its child nodes. In Clojure, I wasn't exactly sure how to approach this, because I'd need to produce a new tree (or at least swap in new nodes) containing directories with sizes. As with the first part of the puzzle, I smelled recursion that I wasn't particularly excited to deal with. Thanks to clojure.walk, I didn't have to!

There's a postwalk function in that namespace that just figures out how to traverse the data structure you give it, provided it's tree-shaped in some way. You also give it a function to transform a node, and it will run this function on each node, swapping the result in to the data structure. Here's what that looks like:

(defn assoc-size [file]
  (if-let [children (:children file)]
    (let [dir-size (->> children (map :size) (apply +))]
      (assoc file :size dir-size))
    file))

(-> sample-input
    parse
    (walk/postwalk assoc-size))

After parsing the sample input into a filesystem tree, we finish up with a postwalk to calculate the directory sizes. As mentioned above, postwalk takes a function that transforms a node: assoc-size. This function does pretty much what you would expect: if there are children (i.e. this node is a directory), assoc in the sum of the sizes of all the children; otherwise, leave the node unchanged, since it already has a size specified.

Filter and Sum the Directories

There's one interesting bit left before we coast across the finish line: how do we get Clojure to traverse the filesystem structure we've built? It turns out there's a function for that too! Since our structure is representing a tree, we can adapt it to Clojure's sequence abstraction by calling tree-seq on it. tree-seq works similarly to constructing a zipper where we have to tell it how to identify branch nodes and how to get child nodes from a branch node. In our case that's simply:

(tree-seq dir? :children filesystem)

With a tree-seq in hand, we're good to use some familiar functions to perform the remaining calculations:

(->> sample-input
     parse
     (walk/postwalk assoc-size)
     (tree-seq dir? :children)
     (filter (every-pred dir? #(<= (:size %) 100000)))
     (map :size)
     (apply +))
;; => 95437

Final Remarks

I'm pretty happy with how this one turned out. It took a while to figure out how to use these new namespaces, but I'm sure the amount of frustration they saved over doing the recursion myself was well worth it. In general, this one really hammered home something I remember from the Functional Design in Clojure podcast2 (paraphrasing): your job is to teach Clojure your domain language by writing predicates, reducing functions, etc. and then leveraging the excellent core library to do the actual work. If you look back at the code I had to write, most of it is exactly that: I define what directories and files are, how to make a filesystem zipper, what it means to insert a node in the filesystem, and how to transform a directory to include a size. The rest is pretty much just gluing existing Clojure functions together (and parsing strings, but that's also pretty domain-specific). The Clojure standard library is a gold mine of useful functions once you figure out how to leverage them properly: they're quite expressive, compose nicely, and have some neat ergonomic choices, such as condp's result-fn alternative.3

It also reenforced how important it is to get your data in the right shape for the problem you're trying to solve. Once the filesystem was built up and populated with file sizes for directories, the rest of the work was pretty simple. It's classic divide-and-conquer: take the problem and break it into two. The first solves the problem assuming the data is in an easy-to-use shape; the second parses the input into that easy-to-use shape.

I had a lot of fun on this one, and I'll be keeping my eyes open for other treasures in the standard library!


  1. You can find all the code here, including my solution for part 2 of the puzzle. I didn't bother covering it above because it was pretty trivial by comparison and followed easily from part 1.

  2. https://clojuredesign.club/

  3. Another interesting one is clojure.string/replace which provides a similar function alternative: (str/replace "my blog title" #"\b(\w)" (fn [[_ letter]] (str/upper-case letter))) ;; => "My Blog Title"

Tags: Clojure