Namespaces and Slurping JSON Data

Jun 29, 2020

Expectations
What I Learned Today
4Clojure Spree
Takeaways

Expectations

My overarching goal for Week 2 is to have a healthier learning schedule for myself. During Week 1, I was up at odd times finishing up my updates. Despite the progress I made, it felt unsustainable in the long term. Besides that, I also want to try building some small tools and utilities with Clojure to get experience working on real world problems rather than contrived exercises.

For today, I felt inspiration to work on 4Clojure problems after working on chapter 6. I also wanted to try to finish Chapter 7 (Namespaces, Test and Data) since it seemed like important practical knowledge that would have immediate benefits.

What I Learned Today

Chapter 7 was a very practical chapter dealing with code organization through namespaces, testing and working with JSON data.

Namespacing

Anyime we create a project with Lein, we are given a stub namespace with a single function. Namespaces always include clojure.core automatically so we can access to special forms, std lib functions and macros via their short names.

We can access a function in a different namespace by first requiring the namespace in the current ns defn and then using it’s fully qualified name:<namespace name>/<function name>. However, using the Fully Qualified name is annoying if we need to use it often. We can solve this a few ways:

Alias a Namespace - (ns user (:require [scratch.utils :as ut]))
Refer to a specific function or var from another namespace by it’s short name - (ns user (:require [scratch.utils :refer [foo]]))
Suck in every function from another namespace with :refer :all - (ns user (:require [scratch.utils :refer :all])). Doing this will allow us to use all functions from scratch.utils by invoking it’s short name. This can however cause confusion and potential namespace pollution.

Using Namespaces allows us to control the complexity of our projects by isolating code into related and understandable pieces.

External Deps

We can find open source libraries on Clojars. Once we’ve found a library and the version number we need, all we have to do is include it under :dependencies in project.clj.

Parsing JSON Data

We use the Cheshire library for parsing JSON strings and creating Clojure data structures from it. We can combine this with the slurp function to read a JSON file from disk and parse it. This would look like -

(defn load-json
  "Given a filename, reads a JSON file and returns it, parsed, with keywords"
  [file]
  (json/parse-string (slurp file) true)) ;; the true arg to parse-string coerces keys into keywords

Once we’ve parsed the JSON into a list or vector, we can work it on like any normal Clojure sequence and apply our usual functions like map, sort, filter etc. This makes working with even large data sets easy and efficient since we get powerful processing functions mixed with lazy evaluation. For eg:

(defn most-duis
  "Given a JSON filename of UCR Crime data for a particular year, finds the counties with most DUIs"
  [file]
  (->> file
       load-json
       (sort-by :driving_under_influence)
       (take-last 10)
       (map (fn [county]
              [(fips (fips-code county))
               (:driving_under_influence county)]))
       (into {})))

Chapter 7 Exercises:

most-duis tells us about the raw number of reports, but doesn’t account for differences in county population. One would naturally expect counties with more people to have more crime! Divide the :driving_under_influence of each county by its :county_population to find a prevalence of DUIs, and take the top ten counties based on prevalence. How should you handle counties with a population of zero?

(defn most-duis-per-capita
  "Given a JSON filename of UCR Crime data for a particular year, finds the counties with most DUIs weighted by Population"
  [file]
  (->> file
       load-json
       (sort-by :driving_under_influence)
       (remove (fn [county] (= (:county_population county) 0)))
       (map (fn [county]
              (let [dui  (:driving_under_influence county)
                    popl (:county_population county)
                    prev (* (float (/ dui popl)) 100)]
                {:county     (fips (fips-code county))
                 :dui        dui
                 :pop        popl
                 :prevalence (str (format "%.2f" prev) "%")})))
       (sort-by :prevalence)
       (take-last 10)))

How do the prevalence counties compare to the original counties? Expand most-duis to return vectors of [county-name, prevalence, report-count, population] What are the populations of the high-prevalence counties? Why do you suppose the data looks this way? If you were leading a public-health campaign to reduce drunk driving, would you target your intervention based on report count or prevalence? Why?

({:county "NC, Graham", :dui 192, :pop 7861, :prevalence "2.44%"}
 {:county "CA, Inyo", :dui 424, :pop 17306, :prevalence "2.45%"}
 {:county "NC, Ashe", :dui 655, :pop 25737, :prevalence "2.54%"}
 {:county "CO, Costilla", :dui 85, :pop 3269, :prevalence "2.60%"}
 {:county "CO, Conejos", :dui 210, :pop 8039, :prevalence "2.61%"}
 {:county "CO, Cheyenne", :dui 45, :pop 1714, :prevalence "2.63%"}
 {:county "TX, Kenedy", :dui 11, :pop 391, :prevalence "2.81%"}
 {:county "VA, Norton", :dui 118, :pop 3691, :prevalence "3.20%"}
 {:county "MS, Tunica", :dui 432, :pop 10650, :prevalence "4.06%"}
 {:county "WI, Menominee", :dui 189, :pop 4617, :prevalence "4.09%"})

This data is interesting as we can clearly see that counties with the highest prevalence rates also tend to have very low populations.

We can generalize the most-duis function to handle any type of crime. Write a function most-prevalent which takes a file and a field name, like :arson, and finds the counties where that field is most often reported, per capita.

(defn most-prevalent
 "Given a JSON filename of UCR Crime data for a particular year, finds the counties with most crimes weighted by Population"
 [file crime top]
 (->> file
      load-json
      (remove (fn [county] (= (:county_population county) 0)))
      (map (fn [county]
             (let [crime_cnt (crime county)
                   popl      (:county_population county)
                   prev      (* (float (/ crime_cnt popl)) 100)]
               {:county     (fips (fips-code county))
                :rawcrime   crime_cnt
                :pop        popl
                :prevalence (str (format "%.2f" prev) "%")})))
      (sort-by :prevalence)
      (take-last top)
      (reverse)))

We simply pass the whatever crime keyword we want into the function to get the prevalence rates for that crime type. For example, (most-prevalent "2008.json" :property_crimes 10) yields:

"VA, Franklin",
 :rawcrime 155,
 :pop 8955,
 :prevalence "1.73%"}
{:county "NC, Halifax",
 :rawcrime 940,
 :pop 54932,
 :prevalence "1.71%"}
{:county "VA, Roanoke",
 :rawcrime 1533,
 :pop 91983,
 :prevalence "1.67%"}
{:county "VA, Fredericksburg",
 :rawcrime 364,
 :pop 22740,
 :prevalence "1.60%"}
{:county "GA, Crisp", :rawcrime 352, :pop 22035, :prevalence "1.60%"}
{:county "AK, Prince of Wales-Outer Ketchikan",
 :rawcrime 30,
 :pop 1883,
 :prevalence "1.59%"}
{:county "NC, Robeson",
 :rawcrime 1942,
 :pop 129322,
 :prevalence "1.50%"})

4Clojure Spree

I went on a 4Clojure Problem solving spree today. I was able to solve roughly 20 Easy problems today, bringing my total tally to 66. Most of these problems felt pretty easy to do. Compared to last week when I struggled with problems of similar difficulty, today felt like I’d really leveled up my intuition on how to approach the problems.

The partition-by function was particularly useful in solving many of these problems. Any time a problem is related to working with consecutive duplicates, I reached for (partition-by identity). Shoutout to @nthd3gr33 for showing me this function during last week’s pair programming session.

A quick example - Problem 30. Compress a Sequence. The problem asks you to write a function which removes consecutive duplicates from a sequence.

[1 1 2 3 3 3 2 2 3] -> '(1 2 3 2 3)

This can be solved with

#(map first (partition-by identity %1))

(partition-by identity %1) applies the identity function to each value in the coll and when it returns a new value compared to the previous one, splits the previous values into a subsequence. This allows us to quickly take a sequence looking like this '(1 1 2 3 3 2 2 3) and transforms it into '((1 1) (2) (3 3) (2 2) (3)). (map first) takes the first element from each sub-sequence yielding '(1 2 3 2 3). Wasn’t that easy!?. There really is a Clojure function for everything

Another interesting problem was No. 122 - Read a Binary Number. This problem asks you to convert a binary number, provided as a string, to it’s numerical value.

This was my solution -

(defn bin-to-dec [s]
  (->> s             ;; "1010"
       (reverse)     ;; "0101"
       (seq)         #_ '(\0 1 \0 \1)
       (map str)     #_ '("0" "1" "0" "1")
       (map #(Integer/parseInt %1))  #_ (0 1 0 1)
       (zipmap (take (count s) (iterate (fn [x] (* 2 x)) 1))) #_ {1 0, 2 1, 4 0, 8 1}
       (reduce (fn [acc [k v]] (+ acc (* k v))) 0))) #_ 10

To be perfectly honest, the solution I have feels a little complicated for what seems to be a simple problem. But I’m stll satisfied with how I reasoned through the solution.

Takeaways

Use namespaces for code organization and isolation of code by concern
Tests allow you to verify that our code is correct
We can use open source libraries in our code for additional features and functions
Clojure is very powerful when it comes to data processing!
partition-by is useful for solving problems related to consecutive sequences

Today’s Tally:

20 4Clojure Problems
Chapter 7 (including Exercises)

See you tomorrow!

Overthunk Ltd.A Solitary Software Production