In this tutorial you will practice extracting information on dataset structure, subsetting data using base R, running logic tests, and using base R functions to describe and summarize data.
Much of this material is adapted from Dr. Emily Burchfield’s fantastic tutorial on Data Management.
First, let’s install the gapminder
package, type install.packages("gapminder")
in the console to pull the package off of the CRAN
repository onto your computer. The package documentation can be found here and more information about the Gapminder project can be found at www.gapminder.org.
Then load the package and the gapminder data.
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.1.2
data(gapminder)
Remember that library()
loads the package for us in your current R session and data()
pulls the pre-made gapminder
dataset into your Global Environment.
Let’s inspect the new gapminder
dataset:
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
Our gapminder
dataset includes the following variables:
Before we start working with our data it is helpful to understand its data structure. What type of data object is this? It looks like a table, right? Let’s see what R says it is by using class()
.
class(gapminder)
## [1] "tbl_df" "tbl" "data.frame"
Notice that our R output says we have classes ‘tbl_df’, ‘tbl’ and ‘data.frame’. This is saying the class of the data object is a table and/or data frame. If you check your environment window you’ll notice that under the “Type” column you have tbl_df
. This just means that we have multiple variables (columns) with different data types and multiple records for each variable.
OK, but what if I’d like to know what class
each variable is? (For example: Are Country’s listed as character strings or a factor/level?)
Using Base R functionality we can use a $
to access specific named columns in a dataframe or table. So we can access the country column by typing `gapminder$country`
. Try checking the class of the country column.
class(gapminder$country)
## [1] "factor"
So the country column uses a Factor
data class.
That means each country name is treated as a unique value or level (treated as nominal/categorical data that takes on a number of limited possible values). More information about using factors can be found here. Understanding factors can be important for some modeling where categorical or ordered variables are used. On a practical level this will become very useful as we start to group data using the dplyr
package.
More fundamentally it is good to be aware of factors because if your numerical variables are in a factor class operations on those variables will not work as intended. Indeed, transformation from factor to numerical data classes is not entirely straightforward. Converting directly from a factor to a numeric data class will not report any errors, but will not provide the expected result. Instead you must convert from factor to character class and then convert from character to a numeric. One way to do this is by typing as.numeric(as.character("your data"))
.
Our categorical variables, or variables that take on a number of limited possible values, are coded as "factor"
variables.
Now if we want to check the class for each column we could repeat this for each variable by typing class(gapminder$variable_name)
for each variable, or we could use the sapply()
function to apply the class()
function across all variables in the gapminder
dataset. Remember, if you’re unfamiliar with any function, i.e. sapply()
, you can ask R how it works using ?sapply()
.
?sapply()
## starting httpd help server ... done
sapply(gapminder, class)
## country continent year lifeExp pop gdpPercap
## "factor" "factor" "integer" "numeric" "integer" "numeric"
You should see that the year column is stored as integers, the lifeExp column is stored as numeric, etc….
Another function for investigating data structure will actually tell us about the data type of the entire object and the data class of each column in one go. Try using the str()
function.
str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Notice how each column is reported with a $
? We’ll use this notation to access specific columns in our datasets.
As shown above, we can extract information for each variable in the dataset using $
. For example, if we wanted to determine the range of years in the dataset we can simply type:
range(gapminder$year)
## [1] 1952 2007
So the data runs from 1952 to 2007. Is there data for every year over this period? We can check to see what unique values of year are in the dataset.
unique(gapminder$year)
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Looks like this dataset only has observations every five years. Good to know!
I’ll refer to base R a lot in this class. When I say this, I’m referring to the functions that come in the basic R language and that are not associated with a particular package that you have to load in using library()
. The functions we’ve used so far like head()
and unique()
are base R functions. In the next tutorial, we’ll start working with functions from the dplyr
package that do not come from base R, including select()
, filter()
and arrange()
.
In general in the tutorials, when I refer to packages, I will just list the package name in this font
. For example, I may refer to ggplot2
, dplyr
, or gapminder
.
I will refer to functions by including closed parentheses ()
, i.e. select()
, filter()
, etc. This is a reminder that functions almost always include arguments which you have to include, i.e. mean(x)
tells R to compute the mean of the variable x
.
Ok, now that we’re up to speed on notation and terms, let’s start playing with our data.
Let’s start by seeing what would happen if we used str()
on a vector instead of a dataframe. First store one variable from gapminder
to a new object then check its structure.
le <- gapminder$lifeExp
str(le)
## num [1:1704] 28.8 30.3 32 34 36.1 ...
You should see a single data class name (e.g. “num” for numeric) followed by [1:1704]
. The square brackets []
denote indexing. Here, because we have a single vector and we are reporting the entire vector we see 1:1704
. (Remember my warning that :
shows up all over in R.) This is basically saying that we are looking at the 1st through the 1704th items in the vector (which is all items in the vector). Not sure about this? You can double-check the number of records in this vector using length()
.
Want to pull out a specific single value from the vector? We can do that using the []
notation as in:
le[52]
## [1] 65.634
This simply returns the 52nd value in the vector, which just happens to be 65.634.
We could also return a subset of the vector using a combination of our []
and :
notation. For example we could pull the last 5 values in our vector.
le_sub <- le[1700:1704]
le_sub
## [1] 62.351 60.377 46.809 39.989 43.487
Now let’s try indexing on a data frame instead of a vector. With the vector within the []
we had one range, for a dataframe we will have two ranges: [x,y]
.
In this case we have indexing for our rows/records in the x slot and indexing for our columns/variables in the y slot. (Some functions will include an argument asking along which dimension you would like to calculate something. In this case x is the 1st dimension and y is the 2nd dimension.)
We could extract all observations (variables) for a single record.
df_sub <- gapminder[52, ]
df_sub
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Argentina Americas 1967 65.6 22934225 8053.
This code says we want to pull information from the 52nd row of the gapminder
dataframe, and that R should return all information from all columns (becuase the y slot is blank).
You can also easily extract a single value or information for just a few columns and rows using []
, :
, and c()
.
df_sub2 <- gapminder[1700:1704, 1:3]
df_sub2
## # A tibble: 5 x 3
## country continent year
## <fct> <fct> <int>
## 1 Zimbabwe Africa 1987
## 2 Zimbabwe Africa 1992
## 3 Zimbabwe Africa 1997
## 4 Zimbabwe Africa 2002
## 5 Zimbabwe Africa 2007
df_sub3 <- gapminder[1700:1704, c(1,6)]
df_sub3
## # A tibble: 5 x 2
## country gdpPercap
## <fct> <dbl>
## 1 Zimbabwe 706.
## 2 Zimbabwe 693.
## 3 Zimbabwe 792.
## 4 Zimbabwe 672.
## 5 Zimbabwe 470.
df_value <- gapminder[52, 4]
df_value
## # A tibble: 1 x 1
## lifeExp
## <dbl>
## 1 65.6
But it’s kinda hard to extract what we want using only index values. It’s easier if we can refer to actual values. Let’s say you want to extract observations for the country of Sri Lanka. We can use base R indexing and a simple logic statement to subset the full dataset:
sri_lanka <- gapminder[gapminder$country == "Sri Lanka", ]
head(sri_lanka)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Sri Lanka Asia 1952 57.6 7982342 1084.
## 2 Sri Lanka Asia 1957 61.5 9128546 1073.
## 3 Sri Lanka Asia 1962 62.2 10421936 1074.
## 4 Sri Lanka Asia 1967 64.3 11737396 1136.
## 5 Sri Lanka Asia 1972 65.0 13016733 1213.
## 6 Sri Lanka Asia 1977 65.9 14116836 1349.
This tells R that you want to find ALL rows (first item) in which country == "Sri Lanka"
(country is Sri Lanka), and to include ALL columns (second item) in the dataset.
Notice the double equal signs ==
that we use for equality testing. (i.e. If we create a variable x
that is set equal to 5 and want to confirm that the x
is, in fact, equal to 5, we could type x==5
. The console should return TRUE
. Try it!)
Don’t forget the ,
after "Sri Lanka"
! If you don’t include the comma you will get an error about invalid subscripts because you have not specified both an x and y location for a dataset with two dimensions. If you only wanted the first column, then you could type gapminder[gapminder$country == "Sri Lanka", 1]
. So we can subset using either column or row number/index or column or row value.
Back to the indexing. If we didn’t want ALL of the columns and only wanted the variable gdpPercap
for Sri Lanka, we could do the following:
sri_lanka_gdp <- gapminder[gapminder$country == "Sri Lanka", "gdpPercap"]
head(sri_lanka_gdp)
## # A tibble: 6 x 1
## gdpPercap
## <dbl>
## 1 1084.
## 2 1073.
## 3 1074.
## 4 1136.
## 5 1213.
## 6 1349.
Since we only want a single variable, this returns a vector listing all observations of Sri Lankan GDP per capita over the years included in the dataset. This isn’t very useful because we don’t know what years are associated with each observation. Let’s pull out yearly data too.
sri_lanka_gdp <- gapminder[gapminder$country == "Sri Lanka", c("year", "gdpPercap")]
head(sri_lanka_gdp)
## # A tibble: 6 x 2
## year gdpPercap
## <int> <dbl>
## 1 1952 1084.
## 2 1957 1073.
## 3 1962 1074.
## 4 1967 1136.
## 5 1972 1213.
## 6 1977 1349.
Here we create a list of column names we want to pull out of the dataset using c()
which combines values into a vector or list much like we did last week. Note that you can use either a list of column names or a list of numerical indexes (e.g. Use c(1,3,5)
to select the first, third, and fifth columns, or use c(1:5)
to select columns one through five).
Now let’s see if you can calculate the average, maximum, and minimum GDP per capita for Sri Lanka over the last 50 years.
mean(sri_lanka_gdp$gdpPercap, na.rm=TRUE)
## [1] 1854.731
max(sri_lanka_gdp$gdpPercap, na.rm=TRUE)
## [1] 3970.095
min(sri_lanka_gdp$gdpPercap, na.rm=TRUE)
## [1] 1072.547
range(sri_lanka_gdp$gdpPercap, na.rm=TRUE) # this will give us both the min and max
## [1] 1072.547 3970.095
Ok, what if I want to know the average GDP per capita for all countries or for all years? aggregate()
can help with that. If you’re confused check out the aggregate()
help info by typing ?aggregate()
into your console.
gdp_country <- aggregate(gapminder[ ,"gdpPercap"], by = gapminder["country"], FUN=mean, na.action=rm)
head(gdp_country)
## country gdpPercap
## 1 Afghanistan 802.6746
## 2 Albania 3255.3666
## 3 Algeria 4426.0260
## 4 Angola 3607.1005
## 5 Argentina 8955.5538
## 6 Australia 19980.5956
gdp_time <- aggregate(gapminder[ ,"gdpPercap"], by = gapminder["year"], FUN=mean, na.action=rm)
head(gdp_time)
## year gdpPercap
## 1 1952 3725.276
## 2 1957 4299.408
## 3 1962 4725.812
## 4 1967 5483.653
## 5 1972 6770.083
## 6 1977 7313.166
Want to see how countries rank by GDP? Let’s try order()
for our dataframe.(Use sort()
for vectors.)
gdp_order <- gdp_country[order(gdp_country$gdpPercap), ]
head(gdp_order)
## country gdpPercap
## 88 Myanmar 439.3333
## 18 Burundi 471.6630
## 43 Ethiopia 509.1152
## 42 Eritrea 541.0025
## 87 Mozambique 542.2783
## 78 Malawi 575.4472
This lists the countries by GDP in ascending order. What about descending order?
gdp_desc<-gdp_country[order(-gdp_country$gdpPercap), ] # or we can use gdp_country[order(gdp_country$gdpPercap, decreasing=TRUE),]
head(gdp_desc)
## country gdpPercap
## 72 Kuwait 65332.91
## 124 Switzerland 27074.33
## 96 Norway 26747.31
## 135 United States 26261.15
## 21 Canada 22410.75
## 91 Netherlands 21748.85
What if we want to add a new variable to our dataset, say an indicator of whether or not a country is located in the continent Africa? We can add a named column to our dataframe using the $
and <-
operators .
africa<- gapminder #assign the original gapminder dataset to a new object for further manipulation
africa$africa <- ifelse(gapminder$continent == "Africa", 1, 0) #create a new column named "africa"
head(africa)
## # A tibble: 6 x 7
## country continent year lifeExp pop gdpPercap africa
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 0
## 2 Afghanistan Asia 1957 30.3 9240934 821. 0
## 3 Afghanistan Asia 1962 32.0 10267083 853. 0
## 4 Afghanistan Asia 1967 34.0 11537966 836. 0
## 5 Afghanistan Asia 1972 36.1 13079460 740. 0
## 6 Afghanistan Asia 1977 38.4 14880372 786. 0
head(africa[africa$africa == 1,]) #confirm that locations with continent == Africa have an africa value of 1
## # A tibble: 6 x 7
## country continent year lifeExp pop gdpPercap africa
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Algeria Africa 1952 43.1 9279525 2449. 1
## 2 Algeria Africa 1957 45.7 10270856 3014. 1
## 3 Algeria Africa 1962 48.3 11000948 2551. 1
## 4 Algeria Africa 1967 51.4 12760499 3247. 1
## 5 Algeria Africa 1972 54.5 14760787 4183. 1
## 6 Algeria Africa 1977 58.0 17152804 4910. 1
#And another way to do this
africa$africa2 <- 0
africa$africa2[africa$continent == "Africa"] <- 1
We’ve now covered some of the basics of indexing and subsetting. I recommend these tutorials on data subsetting and data manipulation to help solidfy and extend your understanding. We will introduce data manipulation using dplyr
next week, but will still make use of base R frequently.
For now we’ll take a look at using base R plot()
and hist()
. Next week we’ll start using ggplot
to make figures.
Let’s start with the quickest and simplest overall visualization of the entire gapminder
dataset.
plot(gapminder)
This produces a matrix of scatterplots where one variable is along the y-axis and another is along the x-axis.
Recall that some of our variables are factor
class (or categorical data). Plotting them along x-y axes doesn’t really make much sense.
Let’s try plotting this again, but only for the numeric variables by subsetting within plot()
.
# manually specify --> totally fine for small datasets, not so great when you have many variables or for scaling up/automating
plot(gapminder[,c("year","lifeExp","pop","gdpPercap")])
plot(gapminder[,c(3,4,5,6)])
# one way to do this without manually specifying
classes <- sapply(gapminder, class)
classes
## country continent year lifeExp pop gdpPercap
## "factor" "factor" "integer" "numeric" "integer" "numeric"
isnum <- classes[classes != "factor"] #subset of classes vector with only columns that are not factors
isnum
## year lifeExp pop gdpPercap
## "integer" "numeric" "integer" "numeric"
plot(gapminder[ , names(gapminder) %in% names(isnum)]) #plot for the subset where column names in gapminder match the column names in the isnum vector
# another way to subset and plot using the which function
isnum_col <- which(classes != "factor") #return a vector of the indices where class is not factor
isnum_col <- which(sapply(gapminder, class) != "factor") #same as above, but we use only 1 line instead of 2
plot(gapminder[,isnum_col]) #plot for the subset of gapminder where column indices are in isnum_col
plot(gapminder[ ,which(sapply(gapminder, class) != "factor")]) #plot for the subset of gapminder where column indices are in isnum_col
As you can see there are lots of potential ways to do this. Many more than I will show here. Some require fewer lines, some are cleaner than others, some require more manual manipulation. How you do this is up to you, but I have a couple of general recommendations/observations:
What if I wanted to change the color or shape of the points?
plot(gapminder[,c(3,4,5,6)], col="blue") #color changes the outline color
plot(gapminder[,c(3,4,5,6)], col="blue", pch=16)#pch changes the shape
plot(gapminder[,c(3,4,5,6)], col="green", pch=0, cex=2) #cex changes the size
plot(gapminder[,c(3,4,5,6)], col="blue", bg="red", pch=22, cex=2) #bg changes the fill color
What if I wanted to create a single scatter plot? In plot()
try specifying a single variable for x and a single variable for y. You could add a second set of points to this existing plot using points()
.
plot(gapminder$year, gapminder$gdpPercap, pch=16)
points(gapminder$year, gapminder$gdpPercap,col="red", pch=0, cex=2)
Check out this source and this source and ?plot()
for more info on plot options.
Let’s try making some more figures. What it I wanted to examine the distribution of the GDP, life expectancy, and population variables? A histogram using hist()
is a simple and quick way to do this.
hist(gapminder$gdpPercap)
hist(gapminder$lifeExp)
hist(gapminder$pop)
Let’s try modifying the histogram settings for the Population variable.
hist(gapminder$pop, breaks = 100, main="Histogram of Population", xlab = "Population", ylab= "Frequency")
hist(gapminder$pop, breaks = 100, main="Histogram of Population", xlab = "Population", ylab= "Frequency", col="red")
Our histograms suggest that life expectancy has a near-normal distribution, while the other variables certainly do not. Maybe they are log-normal? Let’s check…
hist(log(gapminder$pop, 10))
hist(log(gapminder$gdpPercap, 10))
Well, they certainly look more normal now. (Note that I’m using a log base 10 here. Check ?log()
for information on how to use different log bases.) Hmmmm, so the distribution of GDP and population are a lot more uneven across counties and years (by orders of magnitude) while the distribution of life expectancy is relatively constrained.
Now let’s see how GDP and life expectancy vary across countries and continents.
plot(gapminder$country, gapminder$gdpPercap)
plot(gapminder$continent, gapminder$gdpPercap)
plot(gapminder$country, gapminder$lifeExp)
plot(gapminder$continent, gapminder$lifeExp)
As you can see the default plot type for categorical - numerical data plots is not a scatter plot, but instead a box-and-whisker plot. These plots show the mean for the variables as a dark line, the interquartile range (where 25% to 75% data falls) in the box, the lines (“whiskers”) show where 95% of the data falls and the points are “outliers” (data points that fall outside the range where ~95% of the data is).
Next week we’ll start looking at ggplot
which can be used to make some very nice looking plots.