On this page:
1 Theme Song
2 Examplar
3 Planning
4 Problems
4.1 Penguins
4.2 Shelf Sorting
4.3 Missing Authors
4.4 Data Smoothing
4.5 Data Streaming
4.6 Music Sorting

Data Scripting

    1 Theme Song

    2 Examplar

    3 Planning

    4 Problems

      4.1 Penguins

      4.2 Shelf Sorting

      4.3 Missing Authors

      4.4 Data Smoothing

      4.5 Data Streaming

      4.6 Music Sorting

The purpose of this assignment is to get you comfortable with writing clean and correct programs quickly for basic tasks. The tasks are chosen as the kind you would encounter if you are processing data sets in a data science setting; hence the name. We strongly encourage the use of higher-order functions to improve the comprehensibility (and conciseness) of your code, but do not require it (the way we required it in Placement 4).

For this homework, you are expected to do the work entirely on your own (without help from the course staff, whom you may only consult for help with names of library functions or other essential information of that sort). Though we do expect you to be able to do all the problems in this assignment, this is not the same as saying all these problems will be straightforward: we expect you to find some challenging!

Note that some of the problems use special values like 0 or -999 to delimit content. In general, you should not do this in your own programs: use properly structured data rather than embedding the shape in the data. However, sometimes you will confront real-world data sources that do this, and you may not have the freedom to change them. These problems give you a feel for working in such settings.

1 Theme Song

lofi hip hop radio - beats to relax/study by Lofi Girl

2 Examplar

There is no Examplar support for this assignment, but we do have this starter file: Template.

Please put all the solution functions in the same file.

3 Planning

Once again, we’re going to do some planning before implementing. You’ll again use a block-based interface, but with some changes and improvements from the last time you did it. Please read (and follow) these instructions before you start programming the following problems.

4 Problems
4.1 Penguins

An American zoo requests a short-term loan of healthy Gentoo penguins from a Canadian zoo. A healthy male Gentoo penguin weighs between 11lb and 19lb (inclusive), and a healthy female weighs between 9.9lb and 18lb (inclusive).

The Canadian zoo stores all penguin data in the following datatype:

data CanadaPenguin:

  | canada-penguin(name :: String,

      gender :: String,

      weight-in-kg :: Number)

end

You can assume that gender is only ever "male" or "female" for these particular Gentoo penguins.

The Canadians compute which penguins are a healthy weight and send only those penguins to the American zoo. When the Americans return the penguins a month later, the Canadians find that all of the loaned penguins have gained 2lbs due to a rich American diet.

Design a function after-loan:

after-loan :: List<CanadaPenguin> -> List<CanadaPenguin>

which consumes a list of Canadian Gentoo penguins and produces a list of loaned penguins with their new weights (in kgs) after their stay at the American zoo.

We provide helper functions kg-to-lb and lb-to-kg to perform the unit conversions for you. Both consume and produce numbers.

Example:

after-loan(

  [list:

    canada-penguin("Eduardo", "male", 5.44),

    canada-penguin("Daisy", "female", 4.31),

    canada-penguin("Charles", "male", 8.16)])

is

[list:

  canada-penguin("Eduardo", "male", 69976/11025),

  canada-penguin("Charles", "male", 99964/11025)]

4.2 Shelf Sorting

Most bookstores sort the books on their shelves by the author’s last name. Unfortunately, some bookstore patrons do not preserve this order when browsing books, and simply place books back wherever they fit.

Assume that bookstores keep all authors whose last name starts with the same letter on the same shelf, and those shelves are labeled with that letter. A record of which authors are on a given shelf would be represented as:

data Shelf:

  | shelf(letter :: String,

      authors :: List<String>)

end

where letter is a string of length 1 which will only be a lowercase English letter, and authors is a list of authors’ last names. Assume that authors’ last names consist only of lowercase English letters.

Bookstores need to know when a book has been placed on the incorrect shelf. Design this function:

fix-shelves :: List<Shelf> -> List<Shelf>

which consumes a list of Shelf values and produces a list of those Shelf records where there is at least one author who doesn’t belong on that shelf. The output Shelf records should only contain the authors who don’t belong on that shelf. Authors need not be in order in the output; the employees fixing the shelves can put them back in the proper location. Do not generate empty Shelf records; this generates needlessly long reports, which annoys the employees.

Example:

fix-shelves([list:

    shelf("r", [list: "rivest", "sussman", "ronaldo"]),

    shelf("s", [list: "sanderson"]),

    shelf("t", [list: "doyle", "tucker", "talie"])])

is

[list:

  shelf("r", [list: "sussman"]),

  shelf("t", [list: "doyle"])]

4.3 Missing Authors

“The reader does not steal, and the thief does not read.” —folk saying

Even aside from book theft, there are many reasons for someone to want to keep track of which books/authors end up missing from bookstore shelves. Design this function (using the same Shelf datatype above):

missing-authors :: List<Shelf>, List<Shelf> -> List<Shelf>

The first list is what’s on the shelves before, while the second is what’s there afterwards. It produces a list of Shelf values that describes the authors who are present in the first list but missing in the second list.

Assume that both lists of shelves have the same number of Shelf values and that they have the same letters—but not necessarily in the same order.

Example:

missing-authors(

  [list:

        shelf("r", [list: "rivest", "ronaldo"]),

        shelf("s", [list: "sanderson", "sussman", "solar", "stine"]),

        shelf("t", [list: "tucker", "talie", "taylor"])],

  [list:

        shelf("t", [list: "tucker", "talie", "taylor"]),

        shelf("r", [list: "rivest", "ronaldo", "rivera"]),

        shelf("s", [list: "sussman", "solar", "stine"])])

is

[list:

  shelf("r", [list: ]),

  shelf("s", [list: "sanderson"]),

  shelf("t", [list: ])]

4.4 Data Smoothing

For this problem, assume that a personal health record (PHR) contains four pieces of information on a patient: their name, height (in meters), weight (in kilograms), and last recorded heart rate (as beats-per-minute). A doctor’s office maintains a list of the personal health records of all its patients:

data PHR:

  | phr(name :: String,

        height :: Number,

        weight :: Number,

        heart-rate :: Number)

end

You can assume that none of the numeric fields will be zero.

In data analysis, smoothing a data set means approximating it to capture important patterns in the data while omitting noise or other fine-scale structures and phenomena. One simple smoothing technique is to replace each internal element of a sequence of values with the average of that element and its predecessor and successor. Assuming that extreme outlier values are an aberration caused, perhaps, through poor measurement, this averaging process replaces them with a more plausible value in the context of that sequence.

For example, consider this sequence of heart-rate values taken from a list of personal health records:

95 102 98 88 105

The resulting smoothed sequence should be

95 295/3 96 97 105

where:
  • 102 was substituted by 295/3: (95 + 102 + 98) / 3

  • 98 was substituted by 96: (102 + 98 + 88) / 3

  • 88 was substituted by 97: (98 + 88 + 105) / 3

This information can be plotted in a graph such as below, with the smoothed graph superimposed over the original values.

Design a function data-smooth

data-smooth :: List<PHR> -> List<Number>

that consumes a list of PHRs and produces a list of the smoothed heart-rate values (not the entire records).

Example:

As given in the descriptive example above, assuming the initial sequence is instead a list of PHRs with the given values as the heart-rates.

4.5 Data Streaming

When communicating data that must be sent as a stream (a sequence of items sent one at a time), it is difficult to preserve non-linear structure. Instead, the data must be sent with special values that the recipient can recognize from which to reconstruct the original structure.

In this problem, the input is a two-dimensional table of numbers (where the rows are not necessarily the same width). The table is encoded as a list. Each row is terminated by a zero. If you find two consecutive zeros, these terminate the entire table, and any data that come after it should be ignored.

Design a program called table-sum that consumes a list of numbers representing the above kind of table, and produces a list of the sums of each row.

table-sum :: List<Number> -> List<Number>

Example:

table-sum([list: 1, 2, 0, 7, 0, 5, 4, 1, 0, 0, 6])

  is [list: 3, 7, 10]

4.6 Music Sorting

A record store is attempting to determine the full range of years covered by their in-store CD collection. An intern has taken note of the release year of every CD they sell. This list has become unmanageable, so the manager has asked for the list of years to be turned into a list of Segments:

data Segment:

  | solo(yr :: Number)

  | span(start :: Number,

        stop :: Number)

end

A segment can be a singleton, representing a single year yr. The manager wants a singleton when neither x - 1 nor x + 1 is in the list. Otherwise, a segment must be a timespan, representing the (inclusive) range of contiguous years that was in the original list.

Design a function year-summary

year-summary :: List<Number> -> List<Segment>

that takes in a list of years (which may include duplicates) and turns it into a list of Segments that follows the above rules, has the output is chronological order, and has no duplicate or overlapping Segments.

Example:

year-summary([list: 2003, 2004, 2005, 1997, 2003, 2009, 2007, 2010, 2014])

is

[list: solo(1997), span(2003, 2005), solo(2007), span(2009, 2010), solo(2014)]