5  Multi-Dimensional Arrays for Decision Analysis

In this chapter, we explore the xarray package; Python’s premiere package for dealing with labelled multi-dimensional arrays. Let’s digest what that means. Labelled simply means we can use English names instead of numbers to identify data elements; it is so much more intuitive to ask the computer to “give the expected profit associated with decision q” than it is to ask something like “give the \(14^{th}\) element of the \(2^{nd}\) axis” as would be done if using numpy. You might balk and say pandas can have names for rows and columns, and that is true, but what if you need to reference data which requires more than two coordinates as, say, specified by a row and a column in pandas. For example, storing weather related data requires dimensions of “latitude”, “longitude”, and “time” and for each combination of those three dimensions one might want access to multiple data variables like “temperature”, “precipitation”, and “wind speed”. Trying to put that data inside of pandas would be cumbersome and putting that data into numpy would certainly be less intuitive than using nice English labels like “precipitation”. So, labelled multi-dimensional means that data elements can be indexed by as many labels as makes sense and xarray makes this beautifully elegant.

In this chapter, we will start exploring the xarray package by increasing the complexity of our study of the newsvendor problem. Let’s start our journey with Example 5.1.

Example 5.1 We again revisit a newsvendor with a single opportunity to buy newspapers in the morning and then to sell those newspapers on a city corner throughout the day. Our newsvendor’s objective is expressed in two outcomes of interest, namely profit and lost sales, for which he seeks a good combo of high profit and low levels of lost sales; customers who do not find the newspaper are more likely to switch to a different newsvendor. Newspapers are purchased for $1 and sold for $3. Papers remaining at the end of the day are worthless. The newsvendor faces demand that is binomially distributed with parameters \(n=200\) and \(p=0.2\). Let’s help the newsvendor make a good decision by effectively visualizing possible decisions and the associated outcome distributions for profit and service level.

The graphical model of Figure 5.1 shows the explicit dependence of our two outcomes of interest on uncertainty about demand and our decision on how much to order.

Figure 5.1: The newsvendor has two outcomes of interest, profit and lost sales.

Around the nodes, notice the new addition of rectangles,technically called plates. When we see any node presented in a graphical model, we should be thinking of a single realization of that random variable. Inclusion of nodes within a plate represents the fact that our model replicates those nodes once per index of the plate. In Figure 5.1 for example, demand \(d\) has one realization per each of the 100 simulations; \(d_i\) is the demand of the \(i^{th}\) simulation and thus, \(d_{20}\) would be the random demand of the \(20^{th}\) simulation. The statistical model accompanying Figure 5.1 is shown here:

\[ \begin{aligned} i &\equiv \textrm{Index for simulation draws. } i \in \{1, 2, \ldots, 100\}\\ d_i &\equiv \textrm{Daily demand for newspapers, and}\\ d_i &\sim \textrm{Binomial}(n=200,p=0.2).\\ q &\equiv \textrm{Order quantity chosen by decision-maker, where}\\ q &\in \{25, 26, 27, \ldots, 50\} \qquad \textit{ (potentially good order qtys.)}.\\ \pi &\equiv \textrm{Daily profit is revenue minus expenses.}\\ \pi(d_i,q) &= 3 \times \min(d_i,q) - 1 \times q \qquad \textit{ (cannot sell more than ordered)}.\\ \ell &\equiv \textrm{Lost sales. Unmet demand due to being out of stock.}\\ \ell(d_i,q) &= \max(0, d-q) \qquad \textit{ (lost sales cannot be negative)}. \end{aligned} \]

To put the graphical model into code, we previously used a dataframe to store a representative sample of demand and the associated outcomes of interest for a decision. We can repeat that here for just one order quantity, say \(q = 42\):

potentialCusts = 200
purchaseProb = 0.2

rng = default_rng(seed = 111) 
numSims = 100

# create data frame to store simulated demand
newsDF = pd.DataFrame({"simNum": range(1, numSims+1),  # sequence of 1 to 100
                   "demand": rng.binomial(n = potentialCusts, 
                                          p = purchaseProb,
                                          size = numSims)})

## google SEARCH PHRASE: get element-wise minimum of two columns in pandas dataframe
newsDF["profit_q42"] = 3 * np.minimum(newsDF.demand,42) - 1 * 42
newsDF["lostSales_q42"] = np.maximum(0,newsDF.demand - 42)

# view first few 5 rows of newsDF
newsDF.iloc[:5,:]
   simNum  demand  profit_q42  lostSales_q42
0       1      42          84              0
1       2      47          84              5
2       3      43          84              1
3       4      41          81              0
4       5      38          72              0

But now, imagine we want to consider all integer order quantities ranging from 25 to 50; that is 26 different order quantities. It will feel cumbersome to add 52 \({\scriptstyle (26 \times 2)}\) columns, one for each order quantity’s profit and one for each quantity’s lost sales, into a dataframe. There must be a better way to structure how we store this data. More naturally, we can imagine having 26 dataframes, each one specific to a particular order quantity. However, the problem there is we either repeat the simNum and demand columns for each of the 52 data frames, or we omit those columns from the 52 dataframes and somehow know the correspondence of rows in the profit/loss dataframes to rows in the simNum/demand dataframe. All of this just does not feel right, we are repeating data or hiding knowledge about the data’s structure. The answer for us is in a package called xarray that specializes in handling this type of multi-dimensional labelled data problem.

5.1 The xarray Package

Our goal is to elegantly match our data storage structure to the dimensions of the problem and to increase our ability to work with lots of data using human-interpretable labels. We will do that using the two main data structures from the xarray package.

It might take your brain a little extra work to use this package, but in the end you will be more resilient for it; embrace a little productive struggle here, its worth it.

The first structure is a DataArray. In its simplest 1-dimensional form, a DataArray is just a collection of values, like the column of dataframe (pandas.Series) or a one-dimensional array of values (numpy.ndarray). We can create a simple DataArray using its constructor function. The DataArray created here is a container for a 100-element representative sample of demand corresponding to the narrative of Example 5.1:

from numpy.random import default_rng
import numpy as np
import xarray as xr

rng = default_rng(seed = 111)  ## set random seed 
demand = rng.binomial(n=200,p=0.2,size=100)   ## get demand values

## make data array
xr.DataArray(data = demand)
<xarray.DataArray (dim_0: 100)>
array([42, 47, 43, 41, 38, 40, 41, 45, 29, 29, 49, 44, 41, 46, 40, 39, 37,
       41, 37, 37, 38, 28, 28, 39, 48, 38, 47, 41, 43, 37, 40, 41, 35, 44,
       41, 35, 35, 39, 35, 47, 45, 35, 30, 48, 44, 38, 35, 40, 44, 41, 38,
       47, 42, 35, 36, 36, 44, 37, 42, 41, 40, 36, 36, 43, 43, 48, 27, 44,
       38, 37, 32, 48, 49, 29, 37, 44, 46, 42, 33, 36, 49, 33, 46, 35, 36,
       34, 45, 36, 41, 44, 35, 36, 38, 31, 50, 47, 43, 47, 31, 37],
      dtype=int64)
Dimensions without coordinates: dim_0

Above, we see a 100-element array containing our simulated demand values, but we also see some other things, notably a dim_0: 100 and placeholders for coordinates and attributes. Let’s tackle these one-by-one. First, the dimension key-value pair dim_0: 100 tells us that the cardinality of our demand array is 100 (i.e. there are 100 values) and the name of the index or coordinate-axis is dim_0. In numpy, axes are indexed using numbers like \(0,1,2,\ldots,\)etc., but when using xarray we will always want to give our indexes more meaningful and human-interpretable names. When it comes to drawing random samples, we logically name the demand array’s dimension as draw and implement this in code using the dims argument:

## make data array with labelled dimension name
xr.DataArray(data = demand, dims = "draw")
<xarray.DataArray (draw: 100)>
array([42, 47, 43, 41, 38, 40, 41, 45, 29, 29, 49, 44, 41, 46, 40, 39, 37,
       41, 37, 37, 38, 28, 28, 39, 48, 38, 47, 41, 43, 37, 40, 41, 35, 44,
       41, 35, 35, 39, 35, 47, 45, 35, 30, 48, 44, 38, 35, 40, 44, 41, 38,
       47, 42, 35, 36, 36, 44, 37, 42, 41, 40, 36, 36, 43, 43, 48, 27, 44,
       38, 37, 32, 48, 49, 29, 37, 44, 46, 42, 33, 36, 49, 33, 46, 35, 36,
       34, 45, 36, 41, 44, 35, 36, 38, 31, 50, 47, 43, 47, 31, 37],
      dtype=int64)
Dimensions without coordinates: draw

where the output now replaces dim_0: 100 with draw: 100 telling us there are 100 draws of demand. It will come in handy for us to be able to refer to specific draws, say draw #3, so we label these draws. Every label value used for indexing an array is called a coordinate. We add the coordinates to our array of demand using a dict-like object supplied to the coords argument:

The one argument numpy.arange(stop) function returns an array of integers starting at 0 and ending at stop - 1. Mathematically, it returns an array of integers on the half-open interval [0, stop). Thus, numpy.arange(100)+1 gives us a set of 100 integers starting at 1 and ending at 100 (inclusive).

## explicit labeling of coordinates - must use name now to create dataset later
demandDA = xr.DataArray(data = demand, coords = {"draw": np.arange(100)+1}, name = "demand")
demandDA
<xarray.DataArray 'demand' (draw: 100)>
array([42, 47, 43, 41, 38, 40, 41, 45, 29, 29, 49, 44, 41, 46, 40, 39, 37,
       41, 37, 37, 38, 28, 28, 39, 48, 38, 47, 41, 43, 37, 40, 41, 35, 44,
       41, 35, 35, 39, 35, 47, 45, 35, 30, 48, 44, 38, 35, 40, 44, 41, 38,
       47, 42, 35, 36, 36, 44, 37, 42, 41, 40, 36, 36, 43, 43, 48, 27, 44,
       38, 37, 32, 48, 49, 29, 37, 44, 46, 42, 33, 36, 49, 33, 46, 35, 36,
       34, 45, 36, 41, 44, 35, 36, 38, 31, 50, 47, 43, 47, 31, 37],
      dtype=int64)
Coordinates:
  * draw     (draw) int32 1 2 3 4 5 6 7 8 9 10 ... 92 93 94 95 96 97 98 99 100

Notice, we can drop the dims arguments as the dimension name is supplied in the dictionary object passed to the coords argument.

Figure 5.2: The newsvendor has two outcomes of interest, profit and lost sales.

Recall the graphical model of Figure 5.1, reproduced here in the margin as Figure 5.2 for convenience. We have started a mapping of graphical model elements to code equivalents in the xarray package. We have an array of demand values, \(d_i\), indexed by \(i\). The values \(d_i\) are the values of the DataArray and the index \(i\) is the coordinates; \(i\) is the label we use to refer to the draw number of each demand value. Analogously, order quantity, \(q\), can be thought of as the label we use to refer to each order quantity - somewhat confusingly though the array value for order quantity and the label we use to refer to that order quantity are the same. In general, plate indices will map to coordinates and decision/random variable nodes will map to data arrays. Following this rule, let’s make a new data array for order quantity where index and array value get identical quantities - the index represents the order-quantity plate index in Figure 5.1 and the data is mapped from the rectangular decision node.

## creating a DataArray of order quantities - must use name now to create dataset later
orderDA = xr.DataArray(data = np.arange(25,51), 
                       coords = {"orderQtyIndex": np.arange(25,51)},
                       name = "orderQty")
orderDA
<xarray.DataArray 'orderQty' (orderQtyIndex: 26)>
array([25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
       42, 43, 44, 45, 46, 47, 48, 49, 50])
Coordinates:
  * orderQtyIndex  (orderQtyIndex) int32 25 26 27 28 29 30 ... 45 46 47 48 49 50

Now for the magic! We will create the second data structure of the xarray package, the Dataset. A Dataset is a container for related DataArray objects. As Figure 5.2 shows, we need both demand and order quantity to calculate our outcomes of interest, namely profit and lost sales. We will soon see that placing demand and order quantity in the same Dataset allows us to easily derive the DataArray objects for profit and lost sales; both of which are indexed by \(i\) and \(q\) as they are contained in both the Sim Num and Order Quantity plates.

Creating a Dataset from existing DataArray objects is accomplished by passing a list of data array(s) to the merge function which will return a dataset:

# create dataset by combining data arrays
newsvDS = xr.merge([demandDA,orderDA])
newsvDS
<xarray.Dataset>
Dimensions:        (draw: 100, orderQtyIndex: 26)
Coordinates:
  * draw           (draw) int32 1 2 3 4 5 6 7 8 9 ... 93 94 95 96 97 98 99 100
  * orderQtyIndex  (orderQtyIndex) int32 25 26 27 28 29 30 ... 45 46 47 48 49 50
Data variables:
    demand         (draw) int64 42 47 43 41 38 40 41 45 ... 31 50 47 43 47 31 37
    orderQty       (orderQtyIndex) int32 25 26 27 28 29 30 ... 45 46 47 48 49 50

The cardinality of a set is the number of elements in the set. Symbolically, the absolute value notation, \(|\cdot|\) is used to express cardinality of a set. The Cartesian product of two sets \(A\) and \(B\), denoted \(A \times B\), is the set of all combinations (ordered pairs) with one element from \(A\) and one-element from \(B\). The cardinality of the Cartesian product is the product of the cardinalities of the input sets, i.e. \(|A \times B| = |A| \times |B|\). See this wikipedia entry on Cartesian products for more info.

Our Dataset container now has the dimensions implied by the plate indices in Figure 5.1, namely draw: 100 orderQty: 26. In math terms we have two sets, one is a set of demand draws where cardinality \(|D|=100\); the other a set of order quantities with cardinality \(|Q|=26\). Thus, the set of all ordered pairs \((d_i,q)\) to be used for calculation of profit and lost sales will have cardinality \(|D| \times |Q| = 2600\). Thus, our simulation of potential \(\pi(d_i,q)\) and \(\ell(d_i,q)\) values will each have 2,600 elements.

5.2 Seven mental models of dataset manipulation

To calculate the 2,600-elements we will use the first of seven categories of important data manipulations known as the seven mental models of dataset manipulation (inspired by Wickham et al. (2022)). We quickly list the seven mental models here before continuing on towards our objective of how to calculate representative samples of profit and lost sales for each potential order quantity.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation.

“Seven mental models of dataset manipulation” is not an industry standard term. Its a mnemonic device for us to recognize that seven functions will get you most of what you need for manipulating datasets. I take inspiration from Hadley Wickham’s verbs of data manipulation which have been wildly successful in simplifying the ease with which one’s brain is translated into data manipulating code.

The seven most important ways we might want to manipulate a data array or dataset is to:

  1. Assign: .assign() or assign_coords(): Add data variables with broadcasting and array math. (Can also use dict-like methods)
  2. Subset: .sel() or .where() subset a data array or dataset based on coordinates or data values, respectively.
  3. Drop: .drop_vars() or .drop_dims(): Remove an explicit list of data variables or remove all data variables indexed by a particular dimension.
  4. Sort: .sortby() sorts or arranges a data array or dataset based on data values or coordinate values.
  5. Aggregate: See the xarray documentation for a list of aggregation functions. These functions will collapse all the data of a given dimension; for example one can collapse a time dimension using the mean() aggregation method to get the average value for all of time.
  6. Split-Apply-Combine: .groupby() and DatasetGroupBy.foo() are usually used in combination to 1) split the dataset into groups based on levels of a variable, 2) apply a function (e.g. foo()) to each group’s dataset individually, and then 3) combine the modified datasets. See the xarray documentation for more details.
  7. Merge(join): Getting information from two datasets to intelligently combine.

5.2.1 Assign: Adding Data Arrays

Recapping what has been done from a graphical model perspective, we created DataArray objects for all of the nodes that lack parent nodes in the graphical model, like \(d\) and \(q\) in Figure 5.2, and then we used merge() to create a dataset containing all the parentless data arrays, as they are the basis for our coordinates and all additional computations. Now, introduced here, we use the .assign() method to create new data variables for all the other (children) nodes of the graphical model.

Notice the \(\min(d_i,q)\) term used in calculating profit. This term actually represents the number of newspapers sold for any ordered pair of \((d_i,q)\). If demand exceeds the order amount, then the newsvendor can still only sell \(q\) newspapers. And if the newsvendor over-orders, \(q>d\), then the newsvendor can only sell \(d\) newspapers.

(  ## open parenthesis to start readable code
    newsvDS
    .assign(soldNewspapers = np.minimum(newsvDS.demand,newsvDS.orderQty))
) ## close parenthesis finishes the "method chaining"
<xarray.Dataset>
Dimensions:         (draw: 100, orderQtyIndex: 26)
Coordinates:
  * draw            (draw) int32 1 2 3 4 5 6 7 8 9 ... 93 94 95 96 97 98 99 100
  * orderQtyIndex   (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
Data variables:
    demand          (draw) int64 42 47 43 41 38 40 41 ... 31 50 47 43 47 31 37
    orderQty        (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
    soldNewspapers  (draw, orderQtyIndex) int64 25 26 27 28 29 ... 37 37 37 37
(  ## open parenthesis to start readable code
    newsvDS
    .assign(soldNewspapers = np.minimum(newsvDS.demand,newsvDS.orderQty))
    .assign(revenue = 3 * newsvDS.soldNewspapers)
) ## close parenthesis finishes the "method chaining"

The above code will yield an error:

AttributeError: 'Dataset' object has no attribute 'soldNewspapers'

The last assignment apparently does not have visibility into the newly created data for soldNewspapers. To pass the current state of the dataset to the .assign() method, we use a lambda function. The lambda function has syntax lambda arguments : expression where lambda is a keyword telling python to expect an argument (or arguments), followed by a colon (:), and then an expression for what will be returned by the function; in our case here, the argument will provide a way of referencing the current state of the dataset in the method chain. We will call it DS to signal to our brain that the .assign() method is receving a dataset. Here is updated code that works:

(  ## open parenthesis to start readable code
    newsvDS
    .assign(soldNewspapers = np.minimum(newsvDS.demand,newsvDS.orderQty))
    .assign(revenue = lambda DS: 3 * DS.soldNewspapers)
) ## use lambda function to get current state of dataset in chain
<xarray.Dataset>
Dimensions:         (draw: 100, orderQtyIndex: 26)
Coordinates:
  * draw            (draw) int32 1 2 3 4 5 6 7 8 9 ... 93 94 95 96 97 98 99 100
  * orderQtyIndex   (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
Data variables:
    demand          (draw) int64 42 47 43 41 38 40 41 ... 31 50 47 43 47 31 37
    orderQty        (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
    soldNewspapers  (draw, orderQtyIndex) int64 25 26 27 28 29 ... 37 37 37 37
    revenue         (draw, orderQtyIndex) int64 75 78 81 84 ... 111 111 111 111

Yes, I think the need to use lambda functions creates unnecessary cognitive friction. I wish it was easier. That being said, method chaining creates such readable and debuggable code, it is worth incurring a little overhead now to learn about lambda functions.

Now, along with some intermediate variables to enhance understanding of the math, we create the two primary outcomes of interest which are profit and lost sales. We also convert all assignments into lambda functions so that we may use this method chain with another dataset. Lastly, a two-line combo converts our dataset to a dataframe so that we may take an error-checking peek at few observations of our outcomes:

newsvDS = (newsvDS
            .assign(soldNewspapers = np.minimum(newsvDS.demand,newsvDS.orderQty))
            .assign(revenue = lambda DS: 3 * DS.soldNewspapers)
            .assign(expense = 1 * newsvDS.orderQty)
            .assign(profit = lambda DS: DS.revenue - DS.expense)
            .assign(lostSales = np.maximum(0, newsvDS.demand - newsvDS.orderQty))
)

(newsvDS
 .to_dataframe()  #dataframe for printing
 .sample(5, random_state = 111))  ## show five rows of DF
                    demand  orderQty  ...  profit  lostSales
draw orderQtyIndex                    ...                   
39   36                 35        36  ...      69          0
22   46                 28        46  ...      38          0
17   47                 37        47  ...      64          0
60   35                 41        35  ...      70          6
21   48                 38        48  ...      66          0

[5 rows x 7 columns]

Note, one can also add columns directly using dict-like indexing when chains of operations are not required. The following code would work similarly to what we did earlier:

newsvDS["soldNewspapers"] = np.minimum(newsvDS.demand,newsvDS.orderQty)
newsvDS["expense"] = newsvDS.orderQty
newsvDS["revenue"] = 3 * newsvDS.soldNewspapers
newsvDS["profit"] = newsvDS.revenue - newsvDS.expense
newsvDS["lostSales"] = np.maximum(0, newsvDS.demand - newsvDS.orderQty)
newsvDS

5.2.2 Select a subset of the data array or dataset

Recall the syntax of help documentation is often packagename.Class.method where class is typically capitalized. So, when referring to a method like sel() that is available for Dataset object in the xarray package, documentation will refer to the method as xarray.Dataset.sel. Sometimes for brevity, I will drop that package name and use Dataset.sel. Also recall that to use a method, add parentheses after the method name, i.e. sel().

To reduce a dataset or data array by selecting a subset of coordinates to keep, use the associated methods: Dataset.sel() or DataArray.sel.

# select a particular value for a dimension
newsvDS.sel(orderQtyIndex = 36) # returns 1-d dataset
<xarray.Dataset>
Dimensions:         (draw: 100)
Coordinates:
  * draw            (draw) int32 1 2 3 4 5 6 7 8 9 ... 93 94 95 96 97 98 99 100
    orderQtyIndex   int32 36
Data variables:
    demand          (draw) int64 42 47 43 41 38 40 41 ... 31 50 47 43 47 31 37
    orderQty        int32 36
    soldNewspapers  (draw) int64 36 36 36 36 36 36 36 ... 31 36 36 36 36 31 36
    revenue         (draw) int64 108 108 108 108 108 108 ... 108 108 108 93 108
    expense         int32 36
    profit          (draw) int64 72 72 72 72 72 72 72 ... 57 72 72 72 72 57 72
    lostSales       (draw) int64 6 11 7 5 2 4 5 9 0 0 ... 0 0 2 0 14 11 7 11 0 1

xarray follows the pandas convention for selecting a range of coordinate values to keep using the slice function.

# slicing returns all values inside the range (inclusive)
# as long as the index labels are monotonic increasing
newsvDS.sel(orderQtyIndex = slice(36,38))
<xarray.Dataset>
Dimensions:         (draw: 100, orderQtyIndex: 3)
Coordinates:
  * draw            (draw) int32 1 2 3 4 5 6 7 8 9 ... 93 94 95 96 97 98 99 100
  * orderQtyIndex   (orderQtyIndex) int32 36 37 38
Data variables:
    demand          (draw) int64 42 47 43 41 38 40 41 ... 31 50 47 43 47 31 37
    orderQty        (orderQtyIndex) int32 36 37 38
    soldNewspapers  (draw, orderQtyIndex) int64 36 37 38 36 37 ... 31 36 37 37
    revenue         (draw, orderQtyIndex) int64 108 111 114 108 ... 108 111 111
    expense         (orderQtyIndex) int32 36 37 38
    profit          (draw, orderQtyIndex) int64 72 74 76 72 74 ... 55 72 74 73
    lostSales       (draw, orderQtyIndex) int64 6 5 4 11 10 9 7 ... 0 0 0 1 0 0

Slicing returns a smaller dataset or data array based on coordinates, but often we want a smaller dataset based on data values. In these cases, we apply the .where() method where the argument is some logical condition for which data to keep:

# need to explicitly use DataSet.DataArray syntax for
# filtering out rows that do not meet condition
newsvDS.where(newsvDS.lostSales > 0)
<xarray.Dataset>
Dimensions:         (draw: 100, orderQtyIndex: 26)
Coordinates:
  * draw            (draw) int32 1 2 3 4 5 6 7 8 9 ... 93 94 95 96 97 98 99 100
  * orderQtyIndex   (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
Data variables:
    demand          (draw, orderQtyIndex) float64 42.0 42.0 42.0 ... nan nan nan
    orderQty        (orderQtyIndex, draw) float64 25.0 25.0 25.0 ... nan nan nan
    soldNewspapers  (draw, orderQtyIndex) float64 25.0 26.0 27.0 ... nan nan nan
    revenue         (draw, orderQtyIndex) float64 75.0 78.0 81.0 ... nan nan nan
    expense         (orderQtyIndex, draw) float64 25.0 25.0 25.0 ... nan nan nan
    profit          (draw, orderQtyIndex) float64 50.0 52.0 54.0 ... nan nan nan
    lostSales       (draw, orderQtyIndex) float64 17.0 16.0 15.0 ... nan nan nan

Often times, the lambda syntax for anonymous functions gets used to pass in the dataset name:

(newsvDS.where(lambda x: x.lostSales > 0, drop = True)
 .to_dataframe()  #convert to pandas dataframe for printing
 .dropna() # pandas method to remove NaN rows
 .sample(5, random_state = 111))
                    demand  orderQty  ...  profit  lostSales
draw orderQtyIndex                    ...                   
30   34               37.0      34.0  ...    68.0        3.0
98   27               47.0      27.0  ...    54.0       20.0
100  26               37.0      26.0  ...    52.0       11.0
83   30               46.0      30.0  ...    60.0       16.0
26   34               38.0      34.0  ...    68.0        4.0

[5 rows x 7 columns]

You should experiment with omitting the pandas.DataFrame.dropna method from the above. Because filtering on data variables does not change the coordinate system, many coordinate combinations that are of part of the xarray dataset will have missing values because the values that were there did not survive the filtering process (e.g. lostSales > 0).

For more information on selecting subsets of datasets or arrays, or dropping a data array from an existing dataset, this xarray tutorial on selecting and indexing data is useful https://docs.xarray.dev/en/stable/user-guide/indexing.html.

5.2.3 Drop Dimensions

The drop_dims() method returns a new object by dropping a full dimension from a dataset along with any variables whose coordinates rely on that dimension.

newsvDS.drop_dims("orderQtyIndex")
<xarray.Dataset>
Dimensions:  (draw: 100)
Coordinates:
  * draw     (draw) int32 1 2 3 4 5 6 7 8 9 10 ... 92 93 94 95 96 97 98 99 100
Data variables:
    demand   (draw) int64 42 47 43 41 38 40 41 45 29 ... 38 31 50 47 43 47 31 37

Above, the order quantity dimension is dropped along with all the data variables whose value depended on order quantity: orderQty, soldNewspapers, revenue, expense, profit, and lostSales.

If you want to just drop some of the data variables, you use drop_vars():

newsvDS.drop_vars(["revenue","expense"])
<xarray.Dataset>
Dimensions:         (draw: 100, orderQtyIndex: 26)
Coordinates:
  * draw            (draw) int32 1 2 3 4 5 6 7 8 9 ... 93 94 95 96 97 98 99 100
  * orderQtyIndex   (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
Data variables:
    demand          (draw) int64 42 47 43 41 38 40 41 ... 31 50 47 43 47 31 37
    orderQty        (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
    soldNewspapers  (draw, orderQtyIndex) int64 25 26 27 28 29 ... 37 37 37 37
    profit          (draw, orderQtyIndex) int64 50 52 54 56 58 ... 64 63 62 61
    lostSales       (draw, orderQtyIndex) int64 17 16 15 14 13 12 ... 0 0 0 0 0

5.2.4 Sort a data array or dataset based on data values or data values.

For us, sorting is best done outside of xarray. We will typically want dataframe-like reports generated out of xarray as a last step in data manipulation. We will rely on pandas.DataFrame.sort_values() to help us for this mental model.

(newsvDS
 .to_dataframe()
 .sort_values("profit"))
                    demand  orderQty  ...  profit  lostSales
draw orderQtyIndex                    ...                   
67   50                 27        50  ...      31          0
     49                 27        49  ...      32          0
     48                 27        48  ...      33          0
23   50                 28        50  ...      34          0
67   47                 27        47  ...      34          0
...                    ...       ...  ...     ...        ...
73   49                 49        49  ...      98          0
81   49                 49        49  ...      98          0
95   49                 50        49  ...      98          1
11   49                 49        49  ...      98          0
95   50                 50        50  ...     100          0

[2600 rows x 7 columns]

Using .sort_values("profit", ascending = True) would reverse the sort order so maximum profit is first.

5.2.5 Aggregation

See the xarray documentation at https://docs.xarray.dev/en/stable/api.html#id6 for more a complete list of aggregation functions.

This is the mental model we will use to report summary statistics (mean, median, quantile); we will use it throughout the remainder of this textbook.

We have a two-step process:

  1. Aggregate the information in a data array.
  2. Assign the output of the aggregation to a new data array in a pre-existing dataset.

By way of example, let’s get the expected profit associated with each order quantity for our newsvendor in Example 5.1. Currently, our dataset, newsvDS, has 100 values (one for each draw) for each of the 26 order quantities. What we want is just 1-value for each of the 26 order quantities; so, we seek to collapse all 100 draws for an order quantity into 1 number representing the mean profit. Starting this process, we aggregate the profit array along the dimension(s) we want to collapse, in this case we no longer want the draw dimension:

## collapse the 100 draws into 1 summary statistic
newsvDS.profit.mean(dim = "draw")
<xarray.DataArray 'profit' (orderQtyIndex: 26)>
array([50.  , 52.  , 54.  , 55.97, 57.88, 59.7 , 61.49, 63.22, 64.92,
       66.56, 68.17, 69.51, 70.61, 71.47, 72.12, 72.68, 73.09, 73.2 ,
       73.19, 73.03, 72.63, 72.14, 71.56, 70.8 , 69.92, 68.95])
Coordinates:
  * orderQtyIndex  (orderQtyIndex) int32 25 26 27 28 29 30 ... 45 46 47 48 49 50

Notice, this returns a DataArray object. We will then keep our data and summary statistics together in one dataset by adding the array back to the original dataset using assign(). Here the two-step workflow is demonstrated to return expected profit and expected lost sales for each order quantity:

## create mean summary stats
(
    newsvDS
    .assign(expProfit = newsvDS.profit.mean(dim="draw"))
    .assign(expLossSales = newsvDS.lostSales.mean(dim="draw"))
)
<xarray.Dataset>
Dimensions:         (draw: 100, orderQtyIndex: 26)
Coordinates:
  * draw            (draw) int32 1 2 3 4 5 6 7 8 9 ... 93 94 95 96 97 98 99 100
  * orderQtyIndex   (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
Data variables:
    demand          (draw) int64 42 47 43 41 38 40 41 ... 31 50 47 43 47 31 37
    orderQty        (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
    soldNewspapers  (draw, orderQtyIndex) int64 25 26 27 28 29 ... 37 37 37 37
    revenue         (draw, orderQtyIndex) int64 75 78 81 84 ... 111 111 111 111
    expense         (orderQtyIndex) int32 25 26 27 28 29 30 ... 46 47 48 49 50
    profit          (draw, orderQtyIndex) int64 50 52 54 56 58 ... 64 63 62 61
    lostSales       (draw, orderQtyIndex) int64 17 16 15 14 13 12 ... 0 0 0 0 0
    expProfit       (orderQtyIndex) float64 50.0 52.0 54.0 ... 70.8 69.92 68.95
    expLossSales    (orderQtyIndex) float64 14.65 13.65 12.65 ... 0.05 0.01 0.0

Under “Data Variables” notice expProfit and expLoss are one-dimensional with that one-dimension being orderQtyIndex. Pretty cool how we can compactly store all this related information in one dataset! Feel free to play around with these other frequently-used aggregation functions include count, first, last, max, mean, median, min, quantile, and sum.

5.2.6 Split-Apply-Combine

.groupby() and DatasetGroupBy.foo() are usually used in combination to 1) split the dataset into groups based on levels of a variable, 2) apply a function (e.g. foo()) to each group’s dataset individually, and then 3) combine the modified datasets. See the xarray documentation for more details. Here is a small sample fo code:

## find average profit by orderQty
## see documentation here: https://docs.xarray.dev/en/stable/generated/xarray.core.groupby.DatasetGroupBy.mean.html
(
    newsvDS
    .get("profit")
    .groupby("orderQtyIndex")
    .mean(...)
).to_dataframe()
               profit
orderQtyIndex        
25              50.00
26              52.00
27              54.00
28              55.97
29              57.88
30              59.70
31              61.49
32              63.22
33              64.92
34              66.56
35              68.17
36              69.51
37              70.61
38              71.47
39              72.12
40              72.68
41              73.09
42              73.20
43              73.19
44              73.03
45              72.63
46              72.14
47              71.56
48              70.80
49              69.92
50              68.95

5.2.7 Merge(join):

Here, we can bring in information from another dataset. More info for this section is forthcoming.

5.3 Getting Help

The best place to learn about xarray is its help documentation available at https://docs.xarray.dev/en/stable/index.html. Additionally, YouTube has some tutorials available if you search there.

5.4 Questions to Learn From

See CANVAS.