This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
To run a function called funcname, we type “funcname(input1, input2)”, where the inputs are input1 and input2. For instance, to create a vector of integers, we can use the concatenate function c(). Here you can see an example of usage of this function.
x <- c(1,3,2,5)
x
## [1] 1 3 2 5
We can use “=” instead of “<-”.
x = c(1,6,2)
x
## [1] 1 6 2
y = c(1,4,3)
y
## [1] 1 4 3
We can add two vectors which have the same length. To check the length of vectors, we can use “length()” function.
length(x)
## [1] 3
length(y)
## [1] 3
x+y
## [1] 2 10 5
To look at a list of all of the objects, such as data and functions, we can use “ls()” function. To delete any of that we can use “rm()” function.
ls()
## [1] "x" "y"
rm(x)
ls()
## [1] "y"
rm(y)
ls()
## character(0)
Also we can remove all objects at once:
rm(list=ls())
Help file can be used to learn about a function. Here you can see an example which is written for “matrix()” function.
?matrix
## starting httpd help server ... done
The first three inputs of “matrix()” function are the data, the number of rows, and the number of columns. Let’s create a simple matrix.
x=matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=2)
x
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
We can obtain the same matrix by simply writing:
x=matrix(c(1,2,3,4,5,6),3,2)
x
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
As you can see from these examples “matrix()” function fills the matrix column by column. We can change this option like that:
matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
Note that we did not assign the matrix to x in the above command. Therefore, this matrix is printed but is not saved for the future calculations.
Let’s continue with another function which takes the square root of each element of a vector or a matrix:
sqrt(x)
## [,1] [,2]
## [1,] 1.000000 2.000000
## [2,] 1.414214 2.236068
## [3,] 1.732051 2.449490
We can obtain the power 2 of each element by using “x^2”:
x^2
## [,1] [,2]
## [1,] 1 16
## [2,] 4 25
## [3,] 9 36
To obtain a vector of random normal variables the “rnorm()” function is used. Let’s create two vectors and compare them by using “cor()” function which gives the correlation between two vectors.
x = rnorm(50)
y = rnorm(50)
cor(x,y)
## [1] 0.06680277
As we can see from this example each time we call the function “rnorm()”, we will get different vectors.
By default, “rnorm()” function creates standard normal random variables, with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered using the mean and sd arguments:
z=x+rnorm(50, mean=50, sd=.5)
mean(z)
## [1] 49.98524
sd(z)
## [1] 1.393942
Sometimes we want our code to reproduce the exact same set of random numbers; we can use the “set.seed()” function to do this. The input of this function is an arbitrary integer argument.
set.seed(100)
rnorm(50)
## [1] -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127 0.31863009
## [7] -0.58179068 0.71453271 -0.82525943 -0.35986213 0.08988614 0.09627446
## [13] -0.20163395 0.73984050 0.12337950 -0.02931671 -0.38885425 0.51085626
## [19] -0.91381419 2.31029682 -0.43808998 0.76406062 0.26196129 0.77340460
## [25] -0.81437912 -0.43845057 -0.72022155 0.23094453 -1.15772946 0.24707599
## [31] -0.09111356 1.75737562 -0.13792961 -0.11119350 -0.69001432 -0.22179423
## [37] 0.18290768 0.41732329 1.06540233 0.97020202 -0.10162924 1.40320349
## [43] -1.77677563 0.62286739 -0.52228335 1.32223096 -0.36344033 1.31906574
## [49] 0.04377907 -1.87865588
We can obtain the mean, standard deviation, and variance of a vector. As we all know that the square root of variance equals to standard deviation:
set.seed(10)
y=rnorm(1000)
mean(y)
## [1] 0.01137474
sd(y)
## [1] 0.9918421
var(y)
## [1] 0.9837508
sqrt(var(y))
## [1] 0.9918421
The “plot()” function can be used for plotting the data in R. Here you can see an example:
x=rnorm(100)
y=rnorm(100)
plot(x,y)
We can add labels of the axis and title of the graph:
plot(x,y,xlab="x value",ylab="y value",main="Plot of X vs Y")
We can save the output as an R plot. We can create pdf or jpeg files by using “pdf()” or “jpeg()” functions.
pdf("Figure.pdf")
plot(x,y,col="green")
dev.off()
## png
## 2
The function “dev.off()” indicates to R that we are done creating the plot.
To create a sequence of numbers “seq()” function can be used. For example, “seq(a,b,length=c)” creates a vector that has a size of c and elements between a and b. Also, the spaces between the elements are equal.
x=seq(0,1,length=11)
x
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x=0:10
x
## [1] 0 1 2 3 4 5 6 7 8 9 10
We can obtain more complex graphs. The “contour()” function creates a contour plot to obtain three dimensional data. There should be three arguments: a vector of the x values, a vector of the y values, and a matrix whose elements correspond to the z value for each pair of (x,y) coordinates.
x=seq(-pi,pi,length=50)
y=x
f=outer(x,y,function(x,y)cos(y)/(1+x^2))
contour(x,y,f)
contour(x,y,f,nlevels=45,add=T)
fa=(f-t(f))/2
contour(x,y,fa,nlevels=15)
The “image()” function works the same way as “contour()”, except that it produces a color-coded plot whose colors depend on the z value. This is called as a heatmap. As an alternative “persp()” function can be used. The arguments theta and phi control the angles at which the plot is viewed.
image(x,y,fa)
persp(x,y,fa)
persp(x,y,fa,theta=30)
persp(x,y,fa,theta=30,phi=20)
persp(x,y,fa,theta=30,phi=70)
persp(x,y,fa,theta=30,phi=40)
We can reach any element of a matrix. Here you can see an example.
A=matrix(1:16,4,4)
A
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
Let’s type an element.
A[3,4]
## [1] 15
As you can see from this example, the first element corresponds to row and the second element corresponds to column. We reached 3rd row 4th column element which is 15. We can also reach submatrices.
A[c(2,4),c(1,3)]
## [,1] [,2]
## [1,] 2 10
## [2,] 4 12
A[2:4,1:3]
## [,1] [,2] [,3]
## [1,] 2 6 10
## [2,] 3 7 11
## [3,] 4 8 12
A[1:3,]
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
A[,1:3]
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
A[2,]
## [1] 2 6 10 14
We can obtain rows and columns except for undesired ones by using a negative sign.
A[-c(1,3),]
## [,1] [,2] [,3] [,4]
## [1,] 2 6 10 14
## [2,] 4 8 12 16
The size of a matrix can be obtained by using “dim()” function.
dim(A)
## [1] 4 4
We can import a data set into R by using “read.table()” function. To export data “write.table()” function can be used. There are some data sets that are already defined in R. Auto data set is chosen as an example data set. The “fix()” function can be used to view the data set in a spreadsheet like window.
Before writing next line, you should download “Auto.data” file from https://trevorhastie.github.io/ISLR/data.html to your working directory.
library(ISLR)
Auto=read.table("Auto.data")
fix(Auto)
The missing observations are shown with a question mark “?”. Using the option header=T in the “read.table()” function tells R that the first line of the file contains the variable names, and using the option na.strings tells R that any time it sees a particular character or set of characters (such as question mark), it should be treated as a missing element of the data matrix.
Auto=read.table("Auto.data",header=T,na.strings="?")
fix(Auto)
dim(Auto)
## [1] 397 9
Auto[1:4,]
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
There are 397 observations and 9 variables in this data set. Five of the rows contain missing observations in this data set. So we can remove these rows by using “na.omit()” function.
Auto=na.omit(Auto)
dim(Auto)
## [1] 392 9
We can obtain the names of the variables by using “names()” function.
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
We can use the “plot()” function for two different variables of the data set as we mentioned before.
plot(Auto$cylinders, Auto$mpg)
As an alternative way we can use “attach()” function in order to tell R to make the variables in this data frame available by name.
plot(Auto$cylinders, Auto$mpg)
attach(Auto)
plot(cylinders, mpg)
The cylinders variable is stored as a numeric vector, therefore R has treated it as quantitative. The “as.factor()” function converts quantitative variables into qualitative variables.
cylinders=as.factor(cylinders)
We can customize the plot.
plot(cylinders, mpg)
plot(cylinders, mpg, col="red")
plot(cylinders, mpg, col="red", varwidth=T)
plot(cylinders, mpg, col="red", varwidth=T, horizontal=T)
plot(cylinders, mpg, col="red", varwidth=T, xlab="cylinders", ylab="MPG")
The “hist()” function can be used to plot a histogram. Note that col=2 has the same effect as col=“red”.
hist(mpg)
hist(mpg,col=2)
hist(mpg,col=2,breaks=15)
The “pairs()” function creates a scatterplot matrix i.e. a scatterplot for every pair of variables for any given data set. However, if there are any qualitative variables in the data set it gives an error “non-numeric argument to ‘pairs’”. We can also produce scatterplots for just a subset of the variables.
pairs(~mpg + displacement + horsepower + weight + acceleration, Auto)
The “identify()” function is used to identify the value for a particular variable for points on a plot. There are three arguments: the x-axis variable, the y-axis variable, and the variable whose values we would like to see printed for each point.
plot(horsepower,mpg)
identify(horsepower,mpg,cylinders)
## integer(0)
The “summary()” function creates a numerical summary of each variable in a particular data set.
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
We can also produce a summary of just a single variable.
summary(mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 17.00 22.75 23.45 29.00 46.60
We can save the record of all of the commands that we typed in the most recent session by using the “savehistory()” function. Next time we enter R, we can load that history by using the “loadhistory()” function.