R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Basic Commands

To run a function called funcname, we type “funcname(input1, input2)”, where the inputs are input1 and input2. For instance, to create a vector of integers, we can use the concatenate function c(). Here you can see an example of usage of this function.

x <- c(1,3,2,5)
x
## [1] 1 3 2 5

We can use “=” instead of “<-”.

x = c(1,6,2)
x
## [1] 1 6 2
y = c(1,4,3)
y
## [1] 1 4 3

We can add two vectors which have the same length. To check the length of vectors, we can use “length()” function.

length(x)
## [1] 3
length(y)
## [1] 3
x+y
## [1]  2 10  5

To look at a list of all of the objects, such as data and functions, we can use “ls()” function. To delete any of that we can use “rm()” function.

ls()
## [1] "x" "y"
rm(x)
ls()
## [1] "y"
rm(y)
ls()
## character(0)

Also we can remove all objects at once:

rm(list=ls())

Help file can be used to learn about a function. Here you can see an example which is written for “matrix()” function.

?matrix
## starting httpd help server ... done

The first three inputs of “matrix()” function are the data, the number of rows, and the number of columns. Let’s create a simple matrix.

x=matrix(data=c(1,2,3,4,5,6), nrow=3, ncol=2)
x
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

We can obtain the same matrix by simply writing:

x=matrix(c(1,2,3,4,5,6),3,2)
x
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

As you can see from these examples “matrix()” function fills the matrix column by column. We can change this option like that:

matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6

Note that we did not assign the matrix to x in the above command. Therefore, this matrix is printed but is not saved for the future calculations.

Let’s continue with another function which takes the square root of each element of a vector or a matrix:

sqrt(x)
##          [,1]     [,2]
## [1,] 1.000000 2.000000
## [2,] 1.414214 2.236068
## [3,] 1.732051 2.449490

We can obtain the power 2 of each element by using “x^2”:

x^2
##      [,1] [,2]
## [1,]    1   16
## [2,]    4   25
## [3,]    9   36

To obtain a vector of random normal variables the “rnorm()” function is used. Let’s create two vectors and compare them by using “cor()” function which gives the correlation between two vectors.

x = rnorm(50)
y = rnorm(50)
cor(x,y)
## [1] 0.06680277

As we can see from this example each time we call the function “rnorm()”, we will get different vectors.

By default, “rnorm()” function creates standard normal random variables, with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered using the mean and sd arguments:

z=x+rnorm(50, mean=50, sd=.5)
mean(z)
## [1] 49.98524
sd(z)
## [1] 1.393942

Sometimes we want our code to reproduce the exact same set of random numbers; we can use the “set.seed()” function to do this. The input of this function is an arbitrary integer argument.

set.seed(100)
rnorm(50)
##  [1] -0.50219235  0.13153117 -0.07891709  0.88678481  0.11697127  0.31863009
##  [7] -0.58179068  0.71453271 -0.82525943 -0.35986213  0.08988614  0.09627446
## [13] -0.20163395  0.73984050  0.12337950 -0.02931671 -0.38885425  0.51085626
## [19] -0.91381419  2.31029682 -0.43808998  0.76406062  0.26196129  0.77340460
## [25] -0.81437912 -0.43845057 -0.72022155  0.23094453 -1.15772946  0.24707599
## [31] -0.09111356  1.75737562 -0.13792961 -0.11119350 -0.69001432 -0.22179423
## [37]  0.18290768  0.41732329  1.06540233  0.97020202 -0.10162924  1.40320349
## [43] -1.77677563  0.62286739 -0.52228335  1.32223096 -0.36344033  1.31906574
## [49]  0.04377907 -1.87865588

We can obtain the mean, standard deviation, and variance of a vector. As we all know that the square root of variance equals to standard deviation:

set.seed(10)
y=rnorm(1000)
mean(y)
## [1] 0.01137474
sd(y)
## [1] 0.9918421
var(y)
## [1] 0.9837508
sqrt(var(y))
## [1] 0.9918421

Graphics

The “plot()” function can be used for plotting the data in R. Here you can see an example:

x=rnorm(100)
y=rnorm(100)
plot(x,y)

We can add labels of the axis and title of the graph:

plot(x,y,xlab="x value",ylab="y value",main="Plot of X vs Y")

We can save the output as an R plot. We can create pdf or jpeg files by using “pdf()” or “jpeg()” functions.

pdf("Figure.pdf")
plot(x,y,col="green")
dev.off()
## png 
##   2

The function “dev.off()” indicates to R that we are done creating the plot.

To create a sequence of numbers “seq()” function can be used. For example, “seq(a,b,length=c)” creates a vector that has a size of c and elements between a and b. Also, the spaces between the elements are equal.

x=seq(0,1,length=11)
x
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x=0:10
x
##  [1]  0  1  2  3  4  5  6  7  8  9 10

We can obtain more complex graphs. The “contour()” function creates a contour plot to obtain three dimensional data. There should be three arguments: a vector of the x values, a vector of the y values, and a matrix whose elements correspond to the z value for each pair of (x,y) coordinates.

x=seq(-pi,pi,length=50)
y=x
f=outer(x,y,function(x,y)cos(y)/(1+x^2))
contour(x,y,f)
contour(x,y,f,nlevels=45,add=T)

fa=(f-t(f))/2
contour(x,y,fa,nlevels=15)

The “image()” function works the same way as “contour()”, except that it produces a color-coded plot whose colors depend on the z value. This is called as a heatmap. As an alternative “persp()” function can be used. The arguments theta and phi control the angles at which the plot is viewed.

image(x,y,fa)

persp(x,y,fa)

persp(x,y,fa,theta=30)

persp(x,y,fa,theta=30,phi=20)

persp(x,y,fa,theta=30,phi=70)

persp(x,y,fa,theta=30,phi=40)

Indexing Data

We can reach any element of a matrix. Here you can see an example.

A=matrix(1:16,4,4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16

Let’s type an element.

A[3,4]
## [1] 15

As you can see from this example, the first element corresponds to row and the second element corresponds to column. We reached 3rd row 4th column element which is 15. We can also reach submatrices.

A[c(2,4),c(1,3)]
##      [,1] [,2]
## [1,]    2   10
## [2,]    4   12
A[2:4,1:3]
##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    3    7   11
## [3,]    4    8   12
A[1:3,]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
A[,1:3]
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
A[2,]
## [1]  2  6 10 14

We can obtain rows and columns except for undesired ones by using a negative sign.

A[-c(1,3),]
##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16

The size of a matrix can be obtained by using “dim()” function.

dim(A)
## [1] 4 4

Loading Data

We can import a data set into R by using “read.table()” function. To export data “write.table()” function can be used. There are some data sets that are already defined in R. Auto data set is chosen as an example data set. The “fix()” function can be used to view the data set in a spreadsheet like window.

Before writing next line, you should download “Auto.data” file from https://trevorhastie.github.io/ISLR/data.html to your working directory.

library(ISLR)
Auto=read.table("Auto.data")
fix(Auto)

The missing observations are shown with a question mark “?”. Using the option header=T in the “read.table()” function tells R that the first line of the file contains the variable names, and using the option na.strings tells R that any time it sees a particular character or set of characters (such as question mark), it should be treated as a missing element of the data matrix.

Auto=read.table("Auto.data",header=T,na.strings="?")
fix(Auto)
dim(Auto)
## [1] 397   9
Auto[1:4,]
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst

There are 397 observations and 9 variables in this data set. Five of the rows contain missing observations in this data set. So we can remove these rows by using “na.omit()” function.

Auto=na.omit(Auto)
dim(Auto)
## [1] 392   9

We can obtain the names of the variables by using “names()” function.

names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

Additional Graphical and Numerical Summaries

We can use the “plot()” function for two different variables of the data set as we mentioned before.

plot(Auto$cylinders, Auto$mpg)

As an alternative way we can use “attach()” function in order to tell R to make the variables in this data frame available by name.

plot(Auto$cylinders, Auto$mpg)

attach(Auto)
plot(cylinders, mpg)

The cylinders variable is stored as a numeric vector, therefore R has treated it as quantitative. The “as.factor()” function converts quantitative variables into qualitative variables.

cylinders=as.factor(cylinders)

We can customize the plot.

plot(cylinders, mpg)

plot(cylinders, mpg, col="red")

plot(cylinders, mpg, col="red", varwidth=T)

plot(cylinders, mpg, col="red", varwidth=T, horizontal=T)

plot(cylinders, mpg, col="red", varwidth=T, xlab="cylinders", ylab="MPG")

The “hist()” function can be used to plot a histogram. Note that col=2 has the same effect as col=“red”.

hist(mpg)

hist(mpg,col=2)

hist(mpg,col=2,breaks=15)

The “pairs()” function creates a scatterplot matrix i.e. a scatterplot for every pair of variables for any given data set. However, if there are any qualitative variables in the data set it gives an error “non-numeric argument to ‘pairs’”. We can also produce scatterplots for just a subset of the variables.

pairs(~mpg + displacement + horsepower + weight + acceleration, Auto)

The “identify()” function is used to identify the value for a particular variable for points on a plot. There are three arguments: the x-axis variable, the y-axis variable, and the variable whose values we would like to see printed for each point.

plot(horsepower,mpg)
identify(horsepower,mpg,cylinders)

## integer(0)

The “summary()” function creates a numerical summary of each variable in a particular data set.

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

We can also produce a summary of just a single variable.

summary(mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   22.75   23.45   29.00   46.60

We can save the record of all of the commands that we typed in the most recent session by using the “savehistory()” function. Next time we enter R, we can load that history by using the “loadhistory()” function.