Categorical data. Can have ordered and unordered categories.
factor(c("low", "high", "medium"))
is.factor()
The check function above would return TRUE or FALSE.
We can use several other functions to examine the data type. Most frequent use: class(), length()
Function
Description
Example Usage
Sample Output
typeof()
Returns the internal data type of the object (e.g., integer, double, character).
typeof(2.5)
“double”
class()
Retrieves the class or type of the object, indicating how R should handle it.
class(data.frame())
“data.frame”
storage.mode()
Provides the mode of how the object is stored internally (often similar to typeof()).
storage.mode(2L)
“integer”
length()
Gives the count of elements in an object. For matrices, it’s the product of rows and columns.
length(c(1,2,3))
3
attributes()
Shows any metadata attributes associated with the object (e.g., names, dimensions).
attributes(matrix(1:4, ncol=2))
List with $dim attribute
2.1 In-Class Exercise: Data Type
Use class(), length() and is.XXX() to examine the data types
Copy the following code and run them in the R script->Data Type Section.
Code
# For 5.2print(paste("Class of 5.2:", class(5.2)))print(paste("Length of 5.2:", length(5.2)))print(paste("Is it a numeric? ", is.numeric(5.2)))# For 3Lprint(paste("Class of 3L:", class(3L)))print(paste("Length of 3L:", length(3L)))print(paste("Is it an integer? ", is.integer(3L)))# For "Hello, R!"print(paste("Class of 'Hello, R!':", class("Hello, R!")))print(paste("Length of 'Hello, R!':", length("Hello, R!")))print(paste("Is it a character? ", is.character("Hello, R!")))# For TRUEprint(paste("Class of TRUE:", class(TRUE)))print(paste("Length of TRUE:", length(TRUE)))print(paste("Is it a logical? ", is.logical(TRUE)))# For FALSEprint(paste("Class of FALSE:", class(FALSE)))print(paste("Length of FALSE:", length(FALSE)))print(paste("Is it a logical? ", is.logical(FALSE)))# For 2Lprint(paste("Class of 2L:", class(2L)))print(paste("Length of 2L:", length(2L)))print(paste("Is it an integer? ", is.integer(2L)))# For 100Lprint(paste("Class of 100L:", class(100L)))print(paste("Length of 100L:", length(100L)))print(paste("Is it an integer? ", is.integer(100L)))# For 3 + 2iprint(paste("Class of 3 + 2i:", class(3+ 2i)))print(paste("Length of 3 + 2i:", length(3+ 2i)))print(paste("Is it a complex? ", is.complex(3+ 2i)))# For charToRaw("Hello")raw_value <-charToRaw("Hello")print(paste("Class of charToRaw('Hello'):", class(raw_value)))print(paste("Length of charToRaw('Hello'):", length(raw_value)))print(paste("Is it raw? ", is.raw(raw_value)))# For factor(c("low", "high", "medium"))factor_value <-factor(c("low", "high", "medium"))print(paste("Class of the factor:", class(factor_value)))print(paste("Length of the factor:", length(factor_value)))print(paste("Is it a factor? ", is.factor(factor_value)))
3 Data Structure
In general R handles the following data structures:
Data Structure
Description
Creation Function
Example
Vector
Holds elements of the same type.
c()
c(1, 2, 3, 4)
Matrix
Two-dimensional; elements of the same type.
matrix()
matrix(1:4, ncol=2)
Array
Multi-dimensional; elements of the same type.
array()
List
Can hold elements of different types.
list()
list(name="John", age=30, scores=c(85, 90, 92))
Data Frame
Like a table; columns can be different types.
data.frame()
data.frame(name=c("John", "Jane"), age=c(30, 25))
Factor
For categorical data.
factor()
factor(c("male", "female", "male"))
Tibble
Part of tidyverse; improved data frame.
tibble(), as_tibble()
Time Series
Used for time series data.
ts()
Note: There are multiple ways to create each single data structure, the upper table is only for an example.
In this class, we are going to focus on four data structures:
Vector
Matrix
List
Data Frame
3.1 Access to the value
Data Structure
Description
Example
Result Description
Vector
Accessing values by index
v <- c(10, 20, 30, 40); v[2]
Gets the second element: 20
Matrix
Accessing rows and columns using indices
m <- matrix(1:4, 2, 2); m[1,2]
Gets the value in the 1st row, 2nd column: 3
Data Frame
Accessing columns by name and rows by index
df <- data.frame(x=1:3, y=4:6); df$x
Gets the x column: 1, 2, 3
df[1, ]
Gets the first row as a data frame
List
Accessing elements by index or name
lst <- list(a=1, b=2, c=3); lst$a
Gets the a element: 1
lst[[2]]
Gets the second element: 2
Array
Accessing elements using indices in each dimension
arr <- array(1:8, dim=c(2,2,2)); arr[1,2,2]
Accessing value in the given indices
Factor
Accessing levels and values
f <- factor(c("low", "high", "medium")); levels(f)
Gets the levels of the factor
3.2 Vectors
3.2.1 Exercise: Vectors
Part (a) is for in-class use. Part (b) and Challenge Task are for your own practice at home.
Please do the exercise first and then check the solution.
(a) In-class: Create a numeric vector named ages that contains the ages of five friends: 21, 23, 25, 27, and 29.
(b) Take-home: Create a character vector named colors with the values: “red”, “blue”, “green”, “yellow”, and “purple”.
Question 2: Accessing Vector Elements
(a) In-class: Print the age of the third friend from the ages vector.
(b) Take-home: Print the last color in the colors vector without using its numeric index.
Question 3: Vector Operations
(a) In-class: Add 2 years to each age in the ages vector.
(b) Take-home: Combine the ages and colors vectors into a single vector named combined. Print this new vector.
Question 4: Vector Filtering
(a) In-class: From the ages vector, filter and print ages less than 27.
(b) Take-home: From the colors vector, find and print colors that have the letter “e” in them.
Challenge Task!
(a) In-class: Reverse the order of the colors vector. (Hint: Think about how you might use the seq() function or indexing.)
(b) Take-home: Using a loop (advanced), print each color from the colors vector with a statement: “My favorite color is [color]”. (Replace [color] with the actual color from the vector.)]
#(b)for (color in colors) {cat("My favorite color is", color, "\n")}
My favorite color is red
My favorite color is blue
My favorite color is green
My favorite color is yellow
My favorite color is purple
3.3 Matrix
For the matrix data structure, you need to know three things:
How to create a new Matrix
How to do Matrix/Linear Algebra Operations
How to access the Matrix specific cell.
Review the Appendix: Matrix Operations for more information
3.3.1 How to create a new Matrix
To create a matrix in R, you can use the matrix() function. Here’s an example:
Code
# Create a matrix from a vector with 3 rows and 2 columnsmy_matrix <-matrix(1:6, nrow =3, ncol =2)my_matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
In the above example, the sequence 1:6 generates a vector containing numbers from 1 to 6. This is then used to fill a matrix with 3 rows and 2 columns.
3.3.2 How to do Matrix/Linear Algebra Operations
Matrix operations in R are straightforward. You can use common arithmetic operations (+, -, *, /) for element-wise operations or use specific functions for matrix-specific operations.
For instance, matrix multiplication (a common operation in linear algebra) can be done using the %*% operator:
Code
# Create two matricesA <-matrix(c(1, 2, 3, 4), nrow=2)B <-matrix(c(2, 0, 1, 3), nrow=2)# Matrix multiplicationresult <- A %*% Bresult
[,1] [,2]
[1,] 2 10
[2,] 4 14
Note on Matrix Operators:
The difference between * and %*% can be illustrated with an example:
Matrix product of matX and matY versus matY and matX:
Code
matX %*% matY
[,1] [,2]
[1,] 8 20
[2,] 5 13
Code
matY %*% matX
[,1] [,2]
[1,] 13 5
[2,] 20 8
As these results are not equal, the matrix product is not commutative for matX and matY.
3.4 List
Lists in R are a type of data structure that allow you to store elements of different types (e.g., numbers, strings, vectors, and even other lists). Here’s a comprehensive tutorial on using lists in R:
3.4.1 How to create a new List
To create a list in R, you can use the list() function. Here’s how:
Code
# Create a list containing a number, a character string, and a vectormy_list <-list(age =25, name ="John", scores =c(85, 90, 95))my_list
$age
[1] 25
$name
[1] "John"
$scores
[1] 85 90 95
The above code creates a list my_list with three elements: an age, a name, and a vector of scores.
3.4.2 How to modify and add elements to a List
You can modify an existing list element or add a new element by using the [[ ]] operator.
Code
# Modify the agemy_list[[1]] <-26# Add a new element, addressmy_list$address <-"123 R Street"my_list
In this example, the age is modified, and a new element address is added to the list.
3.4.3 How to access elements in a List
To access elements in a list, you can use either the [[ ]] operator or the $ operator:
Code
# Access the name using double square bracketsperson_name <- my_list[[2]]# Access scores using the dollar signtest_scores <- my_list$scoresperson_name
[1] "John"
Code
test_scores
[1] 85 90 95
Here, the second element of my_list (name) is accessed using [[ ]], and the scores are accessed using $.
3.4.4 How to remove elements from a List
You can remove an element from a list by setting it to NULL:
Code
# Remove the address elementmy_list$address <-NULLmy_list
$age
[1] 26
$name
[1] "John"
$scores
[1] 85 90 95
The element address is removed from the list in this example.
Remember, lists are versatile and can hold heterogeneous data, making them crucial for various applications in R, especially when you need to organize and structure diverse data types.
Create a list named student_info containing the following elements:
name: “Alice”
age: 20
subjects: a vector with “Math”, “History”, “Biology”
Display the created list.
(b) Take-home:
Add two more elements to the student_info list:
grades: a vector with scores 90, 85, 88 corresponding to the subjects.
address: “123 Main St”
Display the updated list.
Question 2: Accessing and Analyzing List Elements
(a) In-class:
From the student_info list, extract and print:
The name of the student.
The subjects they are studying.
(b) Take-home:
Calculate and display:
The average grade of the student using the grades element from the list.
The number of subjects the student is studying.
Question 3: Nested Lists
(a) In-class:
Create a nested list named school_info with the following structure:
school_name: “Greenwood High”
students: a list containing two elements:
student1: the student_info list you created in Question 1.
student2: a new list with name as “Bob”, age as 22, and subjects with “Physics”, “Math”, “English”.
Display the created nested list.
(b) Take-home:
Add a new student, student3, to the students list in school_info with your own details. Display the updated school_info list.
Challenging Question:
Given the school_info list:
Write a function named get_average_grade that takes in the school_info list and a student name as arguments. The function should return the average grade for the given student. If the student does not exist in the list or has no grades, return an appropriate message. Test your function with student1, student2, and another name not in the list.
Question 1: Creating and Modifying Lists
(a) In-class:
Code
# Creating the student_info liststudent_info <-list(name ="Alice",age =20,subjects =c("Math", "History", "Biology"))# Displaying the created liststudent_info
# Extracting and printing the name and subjectsstudent_name <- student_info$namestudent_subjects <- student_info$subjectsstudent_name
[1] "Alice"
Code
student_subjects
[1] "Math" "History" "Biology"
(b) Take-home:
Code
# Calculating the average gradeaverage_grade <-mean(student_info$grades)# Counting the number of subjectsnum_subjects <-length(student_info$subjects)average_grade
[1] 87.66667
Code
num_subjects
[1] 3
Question 3: Nested Lists
(a) In-class:
Code
# Creating the nested list school_infoschool_info <-list(school_name ="Greenwood High",students =list(student1 = student_info,student2 =list(name ="Bob", age =22, subjects =c("Physics", "Math", "English")) ))# Displaying the created nested listschool_info
# Adding student3 to the students listschool_info$students$student3 <-list(name ="Charlie", age =23, subjects =c("Chemistry", "Music", "Art"))# Displaying the updated school_info listschool_info
# Function to get the average grade of a studentget_average_grade <-function(school_info, student_name) { student_data <- school_info$students[[student_name]]if (!is.null(student_data) &&!is.null(student_data$grades)) {return(mean(student_data$grades)) } else {return(paste("No grades found for", student_name)) }}# Testing the functionget_average_grade(school_info, "student1")
[1] 87.66667
Code
get_average_grade(school_info, "student2")
[1] "No grades found for student2"
Code
get_average_grade(school_info, "John")
[1] "No grades found for John"
3.5 Data Frame
A data frame is a table-like structure that stores data in rows and columns. Each column in a data frame can be of a different data type (e.g., numeric, character, factor), but all elements within a column must be of the same type. This makes data frames ideal for representing datasets.
3.5.1 Creating a Data Frame
Data frames can be created using the data.frame() function.
Name Age Grade
1 Alice 20 A
2 Bob 21 B
3 Charlie 19 A
3.5.2 Accessing Data in Data Frames (Indexing)
In R, data frames are similar to tables in that they store data in rows and columns. Each row represents an observation and each column a variable. Indexing in data frames refers to accessing specific rows, columns, or cells of the data frame.
By Column Name
You can access the data of a specific column using the $ operator or double square brackets.
Code
# Creating a sample data framedf <-data.frame(Name =c("Alice", "Bob", "Charlie"),Age =c(20, 21, 19),Grade =c("A", "B", "A"))# Accessing the 'Name' column using the `$` operatornames <- df$Nameprint(names)
[1] "Alice" "Bob" "Charlie"
Code
# Accessing the 'Age' column using double square bracketsages <- df[["Age"]]print(ages)
[1] 20 21 19
By Column Index
You can also access a column by its numeric index.
Code
# Accessing the first columnfirst_column <- df[,1]print(first_column)
[1] "Alice" "Bob" "Charlie"
By Row Index
You can access specific rows using their numeric indices.
Code
# Accessing the first and third rowsrows_1_and_3 <- df[c(1,3), ]print(rows_1_and_3)
Name Age Grade
1 Alice 20 A
3 Charlie 19 A
By Row and Column Indices
You can access a specific cell of the data frame using its row and column indices.
Code
# Accessing the age of the second studentage_of_second <- df[2, 2]print(age_of_second)
[1] 21
By Row and Column Names
You can also use row and column names to access specific cells. Note: By default, data frames in R do not have row names unless explicitly set.
Code
# Setting row names for our data framerownames(df) <-c("Student1", "Student2", "Student3")# Accessing the grade of the third student using row and column namesgrade_of_third <- df["Student3", "Grade"]print(grade_of_third)
[1] "A"
3.5.3 Modifying a Data Frame
We will talk more about this section in the Data Cleaning Session. But we can see an example on columns manipulation:
Code
# Adding a new columnstudents$Major <-c("Math", "Biology", "Physics")print(students)
Name Age Grade Major
1 Alice 20 A Math
2 Bob 21 B Biology
3 Charlie 19 A Physics
Code
# Modifying a columnstudents$Age <- students$Age +1print(students)
Name Age Grade Major
1 Alice 21 A Math
2 Bob 22 B Biology
3 Charlie 20 A Physics
Code
# Removing a columnstudents$Major <-NULLprint(students)
Name Age Grade
1 Alice 21 A
2 Bob 22 B
3 Charlie 20 A
3.5.4 Useful Functions for Data Frames
Here are some functions that are particularly useful when working with data frames:
head(df): Displays the first six rows of the data frame df.
tail(df): Displays the last six rows of the data frame df.
str(df): Provides a structured overview of the data frame df, showing the data types of each column and the first few entries of each column.
dim(df): Returns the dimensions (number of rows and columns) of the data frame df.
summary(df): Provides a statistical summary of each column in the data frame df.
3.5.5 Load and Save Data Frames in R
Handling data is one of the most essential aspects of data analysis in R. In this section, we’ll explore how to load data into R from external sources and save it for future use.
Read a CSV File
To read a CSV (Comma-Separated Values) file and store its contents as a data frame, use the read.csv() function.
Code
# Load a CSV file into a data framedf <-read.csv("path_to_file.csv")# Display the first few rows of the data framehead(df)
Saving to CSV Files
To save a data frame to a CSV file, use the write.csv() function.
Code
# Save a data frame to a CSV filewrite.csv(df_csv, "path_to_output_file.csv", row.names =FALSE)# Note: `row.names = FALSE` ensures that row names are not written to the CSV.
Loading Data Frames from Excel Files
To work with Excel files, you might need external packages like readxl and writexl. Steps should be:
Install and load the haven package:install.packages("readxl")
Load the library: library(readxl)
Use data using: df_excel <- read_excel("path_to_file.xlsx")
Loading Data Frames from STATA’s .dta File
To read .dta files from Stata into R, follow these steps:
Install and load the haven package:install.packages("haven")
Load the library: library(haven)
Use data using: df_dta <- read_dta("path_to_file.dta")
4 Summary
By finishing this lecture, you should be able to:
Understand the basic data types, and data structures in R
For Vector, Matrix, List type of data structures, you should know:
Access to the value
Basic operations
Know how to save, read dataframe.
5 Comprehensive Challenge Project
Do not do this project this week, as I know you are busy
This is just for your own practice later and someone who is only use the notes to self-study.
5.1 Introduction
In this challenge project, you will demonstrate your understanding of the basic data types, structures in R, and manipulate built-in data sets to gain insights. This project will test your proficiency in accessing values in vectors, matrices, lists, and data frames, performing basic operations, and saving & reading data frames.
5.2 Dataset
We will use the built-in dataset mtcars. This data was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
5.3 Tasks
5.3.1 Basic Data Types and Structures
(a) Vectors
Create a numeric vector that represents the miles per gallon (mpg) of the mtcars dataset.
Calculate and print the average mpg.
Access and print the mpg of the 10th car.
(b) Matrices
Convert the first 5 rows and 3 columns of the mtcars dataset into a matrix.
Perform a matrix operation: Multiply the above matrix by 2 and print the result.
(c) Lists
Create a list that contains:
A numeric vector of horsepower (hp) of the cars.
A character vector of car model names.
Access and print the name of the 5th car from the list.
5.3.2 Data Frame Operations
Access and print the details of the car with the highest horsepower.
Save this single-row data frame to a CSV file named “high_hp_car.csv”.
Read the “high_hp_car.csv” file back into R and print its contents to confirm the saved data.
5.3.3 Comprehensive Exploration
Filter cars that have an mpg greater than 20 and less than 6 cylinders.
For these filtered cars, calculate:
The average horsepower.
The median weight.
The number of cars with manual transmission (am column value is 1).
5.3.4 Challenge Question!
Create a matrix of dimensions 3x3 using the mpg, hp, and wt (weight) columns for the first three cars.
Invert this matrix. (Hint: You can use the solve() function.)
Check if the matrix is singular before inversion (its determinant should not be zero).
5.4 Deliverables
An R script containing all the operations performed and any auxiliary functions created.
The “high_hp_car.csv” file.
(Optional) A brief report generated by rmd with output (in html PDF file).