3 Data visualization

3.2 First steps

3.2.1 Exercise

  1. Run ggplot(data = mpg). What do you see? An empty plot with a background constructed by ggplot.

3.2.2 Exercise

  1. How many rows and columns are in mtcars?

3.2.3 Exercise

  1. What does the drv variable describe? f = front-wheel drive, r = rear wheel drive, 4 = 4wd

3.2.4 Exercise

  1. Make a scatter-plot of hwy versus cyl.

3.2.5 Exercise

  1. What happens if you make a scatter-plot of class versus drv? Why is the plot not useful? It just displays which class a particular drv falls under, or the reverse. Vertical axis against/versus Horizontal axis.

3.3 Aesthetic mappings

3.3.1 Exercise

  1. What’s gone wrong with this code? Why are the points not blue? Because color is inside the aesthetic wrapper.

3.3.5 Exercise

  1. What does the strike aesthetic do? What shapes does it work with? (Hint: use ?geom_point.) It modifies the width the border. It only works on shapes with borders (like 21).

3.3.6 Exercise

  1. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Ggplot performs the operation and charts the outcome. Note, if relational operators are used, booleans are graphed.

3.4 Facets

3.4.1 Exercise

  1. What happens if you facet a continuous variable? It basically created separate plots sectioning off the continuous data point.

3.4.2 Exercise

  1. What do the empty cells in a plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot? The empty plots indicate that there is not a data point where both values are TRUE or available. For example, there are no 4wd cars that are 5 cyl. The facet_grid also shows this as an empty plot.

3.4.3 Exercise

  1. What plots does the follow code make? What does . do? They make facet plots. When . is included on the first graph, it allows for horizontal faceting.

3.4.4 Exercise

  1. Take the first faceted plot in this section:

What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger data set?

Advantages: faceting allows for an isolating view of the variables observed. In other words, it reduces the potential noise and increases the readability for a particular variable.

Disadvantages: on the other hand, using the color aesthetic is natural to the human eye, and immediately draws on pattern recognition. If data sets were too large (too wide), faceting would not work given the limitation of screen real estate.

3.4.5 Exercise

  1. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables? Nrow and ncol control the number of dimensions of a faceted visualization. Facet grid excludes nrow and ncol because it is specifically for two discrete variables. Therefore, the rows and columns must be divisible by two.

Control the number of rows and columns with nrow and ncol

Facet by multiple variables

Use the labeller option to control how labels are printed:

Free scales

3.4.6 Exercise

  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why? It is visually easier to compare across columns.

3.5 Geometric Objects

You can set the linetype aesthetic to a particular variable

3.5.2 Exercise

  1. Run this code in your head and predict what the output will look like. Then run the code in R and check your predictions:

3.5.3 Exercise

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter? It removes the legend. show.legend was used earlier because three visualizations were plotted together, so the legend would not have applied to all three. Or three legends would have looked messy.

3.5.4 Exercise

  1. What does se argument for geom_smooth() do? It controls the display of the confidence interval. It’s TRUE by default.

3.6 Statistical transformation

3.6.1 Exercise

  1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function? The default is geom_point range.

3.6.2 Exercise

  1. What does geom_col() do? How is it different to geom_bar()? The geom_col function uses stat_identity by default, which basically means it uses the data available provided by y =. Conversely, geom_bar uses stat_count, transforming the data and plotting the frequency (or proportion, if designated) of the x variable.

  2. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common? Check ggplot2 documentation

  3. What variables does stat_smooth() compute? What parameters control its behavior?

  • y: predicted value

  • ymin: lower pointwise confidence interval around the mean

  • ymax: upper pointwise confidence interval around the mean

  • se: standard error

3.6.3 Exercise

  1. In our proportion bar chart, we need to set group = 1. Why? In other words, what is the problem with these two graphs? Without the group designation, geom_bar calculates the proportion relative to x, rather than all the variables combined. That’s why each graph below displays 100%.

3.7 Position adjustments

3.7.1 Exercise

  1. What is the problem with this plot? How could you improve it? The values look discrete when continuous variables are being displayed. We can add some noise and reduce the alpha to make it more visually appealing as a scatter-plot.

3.7.2 Exercise

  1. What parameters to geom_jitter() control the amount of jittering? Width and Height. Width controls the amount of vertical and horizontal jitter in both positive and negative directions.

3.7.3 Exercise

  1. Compare and contrast geom_jitter() with geom_count(). geom-jitter adds noise to points to remove discreteness. geom_count counts the number of observations at each location, and then maps the count to the point area. In other words, it stacks any overplotting on itself and provisions it by size.

3.7.4 Exercise

  1. What’s the default position adjustment for geom_boxplot? Create a visualization of the mpg data set that demonstrates it. position = 'dodge' is the default position.

3.8 Coordinate systems

3.8.1 Exercise

  1. Turn a stacked bar chart into a pie chart using coord_polar().

3.8.4 Exercise

  1. What does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do? The relationship is positive and looks pretty linear. coord_fixed() stipulates the aspect ratio to keep plots balanced. geom_abline() adds a reference line.