In this tutorial you will practice extracting information on dataset structure, subsetting data using base R, running logic tests, and using base R functions to describe and summarize data.

Objectives

Acknowledgements

Much of this material is adapted from Dr. Emily Burchfield’s fantastic tutorial on Data Management.

Set-up

First, let’s install the gapminder package, type install.packages("gapminder") in the console to pull the package off of the CRAN repository onto your computer. The package documentation can be found here and more information about the Gapminder project can be found at www.gapminder.org.

Then load the package and the gapminder data.

library(gapminder)
## Warning: package 'gapminder' was built under R version 4.1.2
data(gapminder)

Remember that library() loads the package for us in your current R session and data() pulls the pre-made gapminder dataset into your Global Environment.

The Data Structure

Let’s inspect the new gapminder dataset:

head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Our gapminder dataset includes the following variables:

  • country
  • continent
  • year
  • lifeExp or life expectancy
  • pop or population
  • gdpPercap or Gross Domestic Product (GDP) per capita

Before we start working with our data it is helpful to understand its data structure. What type of data object is this? It looks like a table, right? Let’s see what R says it is by using class().

class(gapminder)
## [1] "tbl_df"     "tbl"        "data.frame"

Notice that our R output says we have classes ‘tbl_df’, ‘tbl’ and ‘data.frame’. This is saying the class of the data object is a table and/or data frame. If you check your environment window you’ll notice that under the “Type” column you have tbl_df. This just means that we have multiple variables (columns) with different data types and multiple records for each variable.

OK, but what if I’d like to know what class each variable is? (For example: Are Country’s listed as character strings or a factor/level?)

Using Base R functionality we can use a $ to access specific named columns in a dataframe or table. So we can access the country column by typing `gapminder$country`. Try checking the class of the country column.

class(gapminder$country)
## [1] "factor"

So the country column uses a Factor data class.

That means each country name is treated as a unique value or level (treated as nominal/categorical data that takes on a number of limited possible values). More information about using factors can be found here. Understanding factors can be important for some modeling where categorical or ordered variables are used. On a practical level this will become very useful as we start to group data using the dplyr package.

More fundamentally it is good to be aware of factors because if your numerical variables are in a factor class operations on those variables will not work as intended. Indeed, transformation from factor to numerical data classes is not entirely straightforward. Converting directly from a factor to a numeric data class will not report any errors, but will not provide the expected result. Instead you must convert from factor to character class and then convert from character to a numeric. One way to do this is by typing as.numeric(as.character("your data")).

Our categorical variables, or variables that take on a number of limited possible values, are coded as "factor" variables.

Now if we want to check the class for each column we could repeat this for each variable by typing class(gapminder$variable_name) for each variable, or we could use the sapply() function to apply the class() function across all variables in the gapminder dataset. Remember, if you’re unfamiliar with any function, i.e. sapply(), you can ask R how it works using ?sapply().

?sapply()
## starting httpd help server ... done
sapply(gapminder, class)
##   country continent      year   lifeExp       pop gdpPercap 
##  "factor"  "factor" "integer" "numeric" "integer" "numeric"

You should see that the year column is stored as integers, the lifeExp column is stored as numeric, etc….

Another function for investigating data structure will actually tell us about the data type of the entire object and the data class of each column in one go. Try using the str() function.

str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Notice how each column is reported with a $? We’ll use this notation to access specific columns in our datasets.

As shown above, we can extract information for each variable in the dataset using $. For example, if we wanted to determine the range of years in the dataset we can simply type:

range(gapminder$year)
## [1] 1952 2007

So the data runs from 1952 to 2007. Is there data for every year over this period? We can check to see what unique values of year are in the dataset.

unique(gapminder$year)
##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

Looks like this dataset only has observations every five years. Good to know!

Indexing

I’ll refer to base R a lot in this class. When I say this, I’m referring to the functions that come in the basic R language and that are not associated with a particular package that you have to load in using library(). The functions we’ve used so far like head() and unique() are base R functions. In the next tutorial, we’ll start working with functions from the dplyr package that do not come from base R, including select(), filter() and arrange().

In general in the tutorials, when I refer to packages, I will just list the package name in this font. For example, I may refer to ggplot2, dplyr, or gapminder.

I will refer to functions by including closed parentheses (), i.e. select(), filter(), etc. This is a reminder that functions almost always include arguments which you have to include, i.e. mean(x) tells R to compute the mean of the variable x.

Ok, now that we’re up to speed on notation and terms, let’s start playing with our data.

Let’s start by seeing what would happen if we used str() on a vector instead of a dataframe. First store one variable from gapminder to a new object then check its structure.

le <- gapminder$lifeExp

str(le)
##  num [1:1704] 28.8 30.3 32 34 36.1 ...

You should see a single data class name (e.g. “num” for numeric) followed by [1:1704]. The square brackets [] denote indexing. Here, because we have a single vector and we are reporting the entire vector we see 1:1704. (Remember my warning that : shows up all over in R.) This is basically saying that we are looking at the 1st through the 1704th items in the vector (which is all items in the vector). Not sure about this? You can double-check the number of records in this vector using length().

Want to pull out a specific single value from the vector? We can do that using the [] notation as in:

le[52]
## [1] 65.634

This simply returns the 52nd value in the vector, which just happens to be 65.634.

We could also return a subset of the vector using a combination of our [] and : notation. For example we could pull the last 5 values in our vector.

le_sub <- le[1700:1704]

le_sub
## [1] 62.351 60.377 46.809 39.989 43.487

Now let’s try indexing on a data frame instead of a vector. With the vector within the [] we had one range, for a dataframe we will have two ranges: [x,y].

In this case we have indexing for our rows/records in the x slot and indexing for our columns/variables in the y slot. (Some functions will include an argument asking along which dimension you would like to calculate something. In this case x is the 1st dimension and y is the 2nd dimension.)

We could extract all observations (variables) for a single record.

df_sub <- gapminder[52, ]
df_sub
## # A tibble: 1 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Argentina Americas   1967    65.6 22934225     8053.

This code says we want to pull information from the 52nd row of the gapminder dataframe, and that R should return all information from all columns (becuase the y slot is blank).

You can also easily extract a single value or information for just a few columns and rows using [], :, and c().

df_sub2 <- gapminder[1700:1704, 1:3]
df_sub2
## # A tibble: 5 x 3
##   country  continent  year
##   <fct>    <fct>     <int>
## 1 Zimbabwe Africa     1987
## 2 Zimbabwe Africa     1992
## 3 Zimbabwe Africa     1997
## 4 Zimbabwe Africa     2002
## 5 Zimbabwe Africa     2007
df_sub3 <- gapminder[1700:1704, c(1,6)]
df_sub3
## # A tibble: 5 x 2
##   country  gdpPercap
##   <fct>        <dbl>
## 1 Zimbabwe      706.
## 2 Zimbabwe      693.
## 3 Zimbabwe      792.
## 4 Zimbabwe      672.
## 5 Zimbabwe      470.
df_value <- gapminder[52, 4]
df_value
## # A tibble: 1 x 1
##   lifeExp
##     <dbl>
## 1    65.6

But it’s kinda hard to extract what we want using only index values. It’s easier if we can refer to actual values. Let’s say you want to extract observations for the country of Sri Lanka. We can use base R indexing and a simple logic statement to subset the full dataset:

sri_lanka <- gapminder[gapminder$country == "Sri Lanka", ]
head(sri_lanka)
## # A tibble: 6 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Sri Lanka Asia       1952    57.6  7982342     1084.
## 2 Sri Lanka Asia       1957    61.5  9128546     1073.
## 3 Sri Lanka Asia       1962    62.2 10421936     1074.
## 4 Sri Lanka Asia       1967    64.3 11737396     1136.
## 5 Sri Lanka Asia       1972    65.0 13016733     1213.
## 6 Sri Lanka Asia       1977    65.9 14116836     1349.

This tells R that you want to find ALL rows (first item) in which country == "Sri Lanka" (country is Sri Lanka), and to include ALL columns (second item) in the dataset.

Notice the double equal signs == that we use for equality testing. (i.e. If we create a variable x that is set equal to 5 and want to confirm that the x is, in fact, equal to 5, we could type x==5. The console should return TRUE. Try it!)

Don’t forget the , after "Sri Lanka"! If you don’t include the comma you will get an error about invalid subscripts because you have not specified both an x and y location for a dataset with two dimensions. If you only wanted the first column, then you could type gapminder[gapminder$country == "Sri Lanka", 1]. So we can subset using either column or row number/index or column or row value.

Back to the indexing. If we didn’t want ALL of the columns and only wanted the variable gdpPercap for Sri Lanka, we could do the following:

sri_lanka_gdp <- gapminder[gapminder$country == "Sri Lanka", "gdpPercap"]
head(sri_lanka_gdp)
## # A tibble: 6 x 1
##   gdpPercap
##       <dbl>
## 1     1084.
## 2     1073.
## 3     1074.
## 4     1136.
## 5     1213.
## 6     1349.

Since we only want a single variable, this returns a vector listing all observations of Sri Lankan GDP per capita over the years included in the dataset. This isn’t very useful because we don’t know what years are associated with each observation. Let’s pull out yearly data too.

sri_lanka_gdp <- gapminder[gapminder$country == "Sri Lanka", c("year", "gdpPercap")]
head(sri_lanka_gdp)
## # A tibble: 6 x 2
##    year gdpPercap
##   <int>     <dbl>
## 1  1952     1084.
## 2  1957     1073.
## 3  1962     1074.
## 4  1967     1136.
## 5  1972     1213.
## 6  1977     1349.

Here we create a list of column names we want to pull out of the dataset using c() which combines values into a vector or list much like we did last week. Note that you can use either a list of column names or a list of numerical indexes (e.g. Use c(1,3,5) to select the first, third, and fifth columns, or use c(1:5) to select columns one through five).

Now let’s see if you can calculate the average, maximum, and minimum GDP per capita for Sri Lanka over the last 50 years.

mean(sri_lanka_gdp$gdpPercap, na.rm=TRUE)
## [1] 1854.731
max(sri_lanka_gdp$gdpPercap, na.rm=TRUE)
## [1] 3970.095
min(sri_lanka_gdp$gdpPercap, na.rm=TRUE)
## [1] 1072.547
range(sri_lanka_gdp$gdpPercap, na.rm=TRUE) # this will give us both the min and max
## [1] 1072.547 3970.095

Ok, what if I want to know the average GDP per capita for all countries or for all years? aggregate() can help with that. If you’re confused check out the aggregate() help info by typing ?aggregate() into your console.

gdp_country <- aggregate(gapminder[ ,"gdpPercap"], by = gapminder["country"], FUN=mean, na.action=rm)
head(gdp_country)
##       country  gdpPercap
## 1 Afghanistan   802.6746
## 2     Albania  3255.3666
## 3     Algeria  4426.0260
## 4      Angola  3607.1005
## 5   Argentina  8955.5538
## 6   Australia 19980.5956
gdp_time <- aggregate(gapminder[ ,"gdpPercap"], by = gapminder["year"], FUN=mean, na.action=rm)
head(gdp_time)
##   year gdpPercap
## 1 1952  3725.276
## 2 1957  4299.408
## 3 1962  4725.812
## 4 1967  5483.653
## 5 1972  6770.083
## 6 1977  7313.166

Want to see how countries rank by GDP? Let’s try order() for our dataframe.(Use sort() for vectors.)

gdp_order <- gdp_country[order(gdp_country$gdpPercap), ]
head(gdp_order)
##       country gdpPercap
## 88    Myanmar  439.3333
## 18    Burundi  471.6630
## 43   Ethiopia  509.1152
## 42    Eritrea  541.0025
## 87 Mozambique  542.2783
## 78     Malawi  575.4472

This lists the countries by GDP in ascending order. What about descending order?

gdp_desc<-gdp_country[order(-gdp_country$gdpPercap), ] # or we can use gdp_country[order(gdp_country$gdpPercap, decreasing=TRUE),]
head(gdp_desc)
##           country gdpPercap
## 72         Kuwait  65332.91
## 124   Switzerland  27074.33
## 96         Norway  26747.31
## 135 United States  26261.15
## 21         Canada  22410.75
## 91    Netherlands  21748.85

What if we want to add a new variable to our dataset, say an indicator of whether or not a country is located in the continent Africa? We can add a named column to our dataframe using the $ and <- operators .

africa<- gapminder #assign the original gapminder dataset to a new object for further manipulation

africa$africa <- ifelse(gapminder$continent == "Africa", 1, 0) #create a new column named "africa"

head(africa)
## # A tibble: 6 x 7
##   country     continent  year lifeExp      pop gdpPercap africa
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>  <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.      0
## 2 Afghanistan Asia       1957    30.3  9240934      821.      0
## 3 Afghanistan Asia       1962    32.0 10267083      853.      0
## 4 Afghanistan Asia       1967    34.0 11537966      836.      0
## 5 Afghanistan Asia       1972    36.1 13079460      740.      0
## 6 Afghanistan Asia       1977    38.4 14880372      786.      0
head(africa[africa$africa == 1,]) #confirm that locations with continent == Africa have an africa value of 1
## # A tibble: 6 x 7
##   country continent  year lifeExp      pop gdpPercap africa
##   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>  <dbl>
## 1 Algeria Africa     1952    43.1  9279525     2449.      1
## 2 Algeria Africa     1957    45.7 10270856     3014.      1
## 3 Algeria Africa     1962    48.3 11000948     2551.      1
## 4 Algeria Africa     1967    51.4 12760499     3247.      1
## 5 Algeria Africa     1972    54.5 14760787     4183.      1
## 6 Algeria Africa     1977    58.0 17152804     4910.      1
#And another way to do this
africa$africa2 <- 0

africa$africa2[africa$continent == "Africa"] <- 1

We’ve now covered some of the basics of indexing and subsetting. I recommend these tutorials on data subsetting and data manipulation to help solidfy and extend your understanding. We will introduce data manipulation using dplyr next week, but will still make use of base R frequently.

Make sure you’re familiar with how to index and wrangle data in base R before we proceed!

Visualizing our data

For now we’ll take a look at using base R plot() and hist(). Next week we’ll start using ggplot to make figures.

Let’s start with the quickest and simplest overall visualization of the entire gapminder dataset.

plot(gapminder)

This produces a matrix of scatterplots where one variable is along the y-axis and another is along the x-axis.

Recall that some of our variables are factor class (or categorical data). Plotting them along x-y axes doesn’t really make much sense.

Let’s try plotting this again, but only for the numeric variables by subsetting within plot().

# manually specify --> totally fine for small datasets, not so great when you have many variables or for scaling up/automating

  plot(gapminder[,c("year","lifeExp","pop","gdpPercap")])
  plot(gapminder[,c(3,4,5,6)])

# one way to do this without manually specifying
  classes <- sapply(gapminder, class)
  classes 
##   country continent      year   lifeExp       pop gdpPercap 
##  "factor"  "factor" "integer" "numeric" "integer" "numeric"
  isnum <- classes[classes != "factor"] #subset of classes vector with only columns that are not factors
  isnum
##      year   lifeExp       pop gdpPercap 
## "integer" "numeric" "integer" "numeric"
  plot(gapminder[ , names(gapminder) %in% names(isnum)]) #plot for the subset where column names in gapminder match the column names in the isnum vector

# another way to subset and plot using the which function
  isnum_col <- which(classes != "factor") #return a vector of the indices where class is not factor
  
  isnum_col <- which(sapply(gapminder, class) != "factor") #same as above, but we use only 1 line instead of 2
  
  
  plot(gapminder[,isnum_col]) #plot for the subset of gapminder where column indices are in isnum_col
  
  plot(gapminder[ ,which(sapply(gapminder, class) != "factor")]) #plot for the subset of gapminder where column indices are in isnum_col

As you can see there are lots of potential ways to do this. Many more than I will show here. Some require fewer lines, some are cleaner than others, some require more manual manipulation. How you do this is up to you, but I have a couple of general recommendations/observations:

  1. Don’t spend hours trying to create a one-line solution when you can spend 5 minutes on a three-line solution.
  2. Simpler is usually better.
  3. If it’s hard to read, it’s hard to troubleshoot.
  4. Comment, comment, comment.

What if I wanted to change the color or shape of the points?

plot(gapminder[,c(3,4,5,6)], col="blue") #color changes the outline color

plot(gapminder[,c(3,4,5,6)], col="blue", pch=16)#pch changes the shape

plot(gapminder[,c(3,4,5,6)], col="green", pch=0, cex=2) #cex changes the size

plot(gapminder[,c(3,4,5,6)], col="blue", bg="red", pch=22, cex=2) #bg changes the fill color

  • What do the plots tell you abou the data?

What if I wanted to create a single scatter plot? In plot() try specifying a single variable for x and a single variable for y. You could add a second set of points to this existing plot using points().

plot(gapminder$year, gapminder$gdpPercap, pch=16)

points(gapminder$year, gapminder$gdpPercap,col="red", pch=0, cex=2)

Check out this source and this source and ?plot() for more info on plot options.

Let’s try making some more figures. What it I wanted to examine the distribution of the GDP, life expectancy, and population variables? A histogram using hist()is a simple and quick way to do this.

hist(gapminder$gdpPercap)

hist(gapminder$lifeExp)

hist(gapminder$pop)

Let’s try modifying the histogram settings for the Population variable.

hist(gapminder$pop, breaks = 100, main="Histogram of Population", xlab = "Population", ylab= "Frequency")

hist(gapminder$pop, breaks = 100, main="Histogram of Population", xlab = "Population", ylab= "Frequency", col="red")

Our histograms suggest that life expectancy has a near-normal distribution, while the other variables certainly do not. Maybe they are log-normal? Let’s check…

hist(log(gapminder$pop, 10))

hist(log(gapminder$gdpPercap, 10))

Well, they certainly look more normal now. (Note that I’m using a log base 10 here. Check ?log() for information on how to use different log bases.) Hmmmm, so the distribution of GDP and population are a lot more uneven across counties and years (by orders of magnitude) while the distribution of life expectancy is relatively constrained.

Now let’s see how GDP and life expectancy vary across countries and continents.

plot(gapminder$country, gapminder$gdpPercap)

plot(gapminder$continent, gapminder$gdpPercap)

plot(gapminder$country, gapminder$lifeExp)

plot(gapminder$continent, gapminder$lifeExp)

As you can see the default plot type for categorical - numerical data plots is not a scatter plot, but instead a box-and-whisker plot. These plots show the mean for the variables as a dark line, the interquartile range (where 25% to 75% data falls) in the box, the lines (“whiskers”) show where 95% of the data falls and the points are “outliers” (data points that fall outside the range where ~95% of the data is).

  • What do these plots tell you?

Next week we’ll start looking at ggplot which can be used to make some very nice looking plots.