7 Tips for getting started with R
Avoid common pitfalls of the R programming language and get started quickly
Thus, you have officially entered the world of R. Perhaps you have a background in STATA or SAS, or perhaps you are proficient in Python. Or, you might be an Excel expert. Whatever your background, you should know something about the R programming language before diving into it. This article highlights seven things you should know before you start using R and helps you avoid some of the most common problems that new R language users encounter.
Before diving into our technique, let’s quickly define R. R is a programming language commonly used in statistical computing and graphics. However, the large number of user-created packages has extended R’s applicability beyond data analysis and visualization.
1. There is an applicable package for this problem
Initially, you might be tempted to cobble together solutions to problems that seem impossible to solve using the base R package. This is a bad idea for a number of reasons, starting with the fact that there may be a package that can be used to simplify your solution and ensure that it is error-free. You need to know and use your R packages, most of which are stored in Comprehensive R Network (CRAN).
Listing 1 shows an example of a package that makes things easier for you.
Listing 1. An example R package
Library (sandwich) library(lmtest) xs < -rnorm (1000,10,2) ys < -xs *2 X < -cbind (1,xs) plot(ys~xs) ys < -ys + xs* Plot (ys~xs) mod < -lm (ys~ XS) summary(mod) ehat < -diag (residuals(mod)^2) SQRT (diag(solve(t(X)%*%X) %*% t(X) %*% ehat %*% X %*% solve(t(X)%*%X))) coeftest(mod, vcov = sandwich)Copy the code
The code in Listing 1 creates two data vectors that will be used in the regression function (lines 3-5). The data is designed to correlate X and Y, but heteroscedasticity is also present. The code in line 7 shows how heteroscedasticity arises. Y (already related to X) has a random component created by X. Heteroscedasticity does not hinder your ability to estimate a regression. However, the standard error of the various coefficients is considered unreliable, so the regression value is only an estimate, not an actual calculated value. We can then run the regression to see the standard error (error standard error) from the summary function.
To compute reliable standard errors, you need to create a residual matrix as well as an X-matrix (lines 11 and 5). These matrices are then required to perform matrix formulas to obtain reliable standard errors (line 12). You may need to view some actions online to remember how to perform them and, inevitably, trial and error. The next line (line 13) shows how reliable standard errors can be obtained using the Sandwich and LMTest packages.
These packages are loaded in lines 1 and 2. To use them, you must install them using the install.packages() function. These packages are easier to use and less likely to produce errors than trying to design your own approach. Not only can these packages perform tasks that base R cannot, but they have also been vetted by the open source community.
It helps that packages in R have excellent standardized documentation and a strong emphasis on examples. You should use packages to solve problems in base R, making sure you understand these packages and use them in the best way possible. This will make your R code perfect.
2. How to set up data structure
It is important to understand how to structure data in R. The scripts and output in this section list some of the data structures available, including vectors, matrices, data frames, and lists.
Data structure use cases
Let’s take a quick look at the use cases for each data structure type.
- Vector: Used when you need to store a variable of the same type, such as the weight of all The Ford F-150s in the dataset.
- Matrix: Used when you need to store multiple variables of the same type or when you need to move or transform data in R. Many R functions require data to be entered into matrices, such as principal component analysis.
- Data frames: Used to store multiple variables of different data types; They are great for investigating and observing data sets.
- List: Used to move data when the type length of the data is uncertain. Lists are useful as return tools for function results.
Let’s first examine vectors, as shown in Listing 2. A vector is a collection of data of the same type. They are the basic units for moving data in R, and they are created using C functions. They are easy to create, and individual elements can be viewed using an index that starts at 1, not 0.
The vector in Listing 2.r
#Vectors
vex <- c(1,5,3)
vex
[1] 1 5 3
vex <- c(1 + 2i,4,5)
vex
[1] 1+2i 4+0i 5+0i
vex <- c(TRUE,1,0,1)
vex
[1] 1 1 0 1
vex <- c("This","That",1)
vex
[1] "This" "That" "1"
vex[1]
[1] "This"Copy the code
This code shows that when you feed a C function a set of different types of data, it casts the data to a given type.
The same is true for matrices, as shown in Listing 3. Matrices are not just two-dimensional vectors. R also has multi-dimensional arrays, and these arrays are higher dimensional matrices. You can take two vectors and create a matrix. If you specify no other options, the resulting matrix MAT1 is a one-column matrix. If you specify the number of rows you want to include in MAT1 by specifying “2,” the matrix is now a 2×5 matrix.
The matrix in Listing 3.r
#Matrices v1 <- c(0:4) v2 <- c(5:10) mat1 <- matrix(c(v1,v2)) mat1 [,1] [1,] 0 [2,] 1 [3,] 2 [4,] 3 [5,] 4 [6,] 5 [7,] 6 [8,] 7 [9,] 8 [10,] 9 mat1 <- matrix(c(v1,v2),nrow=2) mat1 [,1] [,2] [,3] [,4] [,5] [1,] 0 2 4 6 8 [2,] 1 3 5 7 9 mat2 < -matrix (C (v1,v2),ncol=2) MAT1 *mat2 Error in MAT1 * MAT2: Non-conformable arrays mat3 < -mat1% *%mat2 mat3[1,2] [1] 160Copy the code
You will notice that creating a matrix does not stack the vectors on top of each other. This is because, unless otherwise specified, the matrix enters data in columns. This way, if you create the same matrix but specify two columns, you cannot multiply the matrices. Matrices multiply and divide cell by cell, not matrix, by default. You can specify matrix multiplication using %. In these cases, after performing matrix multiplication, you can access the cells in the first row and second column of the resulting matrix.
Let’s talk about data frames, as shown in Listing 4.
The following script starts by creating a data frame using the data.frame function. You can add data without specifying a variable name. In this example, variable names are already specified. You’ll notice that data frames can store different types of data, while matrices and vectors store only one type of data.
Data frames in Listing 4.r
#Dataframes df <- data.frame(Bool = sample(c(TRUE,FALSE),100,replace=T),Int = c(1:100),String=sample(LETTERS,100,replace=TRUE)) df$Bool [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE [23] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE [45] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE [67] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [89] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE df[df$Bool,] Bool Int String 2 TRUE 2 Z 3 TRUE 3 R 5 TRUE 5 K 6 TRUE 6 T 10 TRUE 10 O 11 TRUE 11 U 13 TRUE 13 Y 14 TRUE 14 Z 16 TRUE 16 N df[df$Bool,3] [1] Z R K T O U Y Z N H D L B H N L D Z R M I W W I M D A C B R S M Y Y F V B W P Q Q S M Y K Z J V I Levels: A B C D E F H I J K L M N O P Q R S T U V W X Y Z df$String[df$Bool] [1] Z R K T O U Y Z N H D L B H N L D Z R M I W W I M D A C B R S M Y Y F V B W P Q Q S M Y K Z J V I Levels: A B C D E F H I J K L M N O P Q R S T U V W X Y Z df$NewVar <- c(1:101) Error in `$<-.data.frame`(`*tmp*`, "NewVar", value = 1:101) : replacement has 101 rows, data has 100 df$NewVar <- c(1:99) Error in `$<-.data.frame`(`*tmp*`, "NewVar", value = 1:99) : replacement has 99 rows, data has 100Copy the code
In this code, you can see the output when the $operator is used to call the data frame variable Bool (line 4). This is a common and easy way to access data stored in data frames. Next, you can use booleans or 1 and 0 to access specific rows or columns. Df [df$Bool,] (line 5) outputs all lines where Bool is TRUE. In this listing, because no columns are specified, all columns are printed.
The next call specifies the third column, so only the value of the String variable is printed (line 6). After that, you can see another way of doing it. The $operator is used to specify the String variable, which can be treated as a vector, and then a Bool value is used to specify which values are output. Because the String variable is specified first, there is no need to specify the column when using Bool to select what to output (line 7).
Obviously, data frames are much more flexible than vectors and matrices. However, the last two lines of script above indicate that all new variables must have the same length. This requirement is not very strict, but lists (which we will discuss now) are less strict when adding data.
Listing 5 shows the code to create a list of three different length vectors. You can still access list objects using the $operator, or you can use [[]].
Another important aspect of a list is that it can contain other data objects. To make a list contain other data objects, you can add a data frame to the list (line 6). The full data frame is stored and can still be accessed in the same way we discussed earlier (line 8). In addition, lists can have a specified name or no name. All of these are important when creating functions or packages. Many different routines create lots of different types of data that need to be stored in different ways. Lists become useful because all of these different types of data objects can be returned as a package.
The list in Listing 5.r
#Lists ls <- list(Bool = sample(c(TRUE,FALSE),50,replace=T),Int = c(1:75),String=sample(LETTERS,100,replace=TRUE)) ls$String [1] "O" "F" "E" "U" "O" "P" "E" "V" "G" "Z" "T" "F" "T" "C" "J" "P" "G" "L" "M" "E" "O" "T" "E" "R" "A" "Z" "E" "Y" "N" "Y" "U" "N" "E" [34] "T" "N" "W" "Z" "D" "S" "R" "P" "C" "H" "G" "N" "Y" "P" "M" "H" "A" "J" "Y" "C" "C" "Y" "S" "P" "J" "W" "J" "H" "E" "B" "Z" "X" "T" [67] "B" "M" "I" "P" "V" "I" "H" "M" "D" "I" "T" "L" "J" "F" "M" "B" "J" "E" "G" "K" "E" "U" "F" "U" "T" "L" "B" "Z" "U" "X" "U" "P" "D" [100] "W" Ls[[3]] [1] "O" "F" "E" "U" "O" "P" "E" "V" "G" "Z" "T" "F" "T" "C" "J" "P" "G" "L" "M" "E" "O" "T" "E" "R" "A" "Z" "E" "Y" "N" "Y" "U" "N" "E" [34] "T" "N" "W" "Z" "D" "S" "R" "P" "C" "H" "G" "N" "Y" "P" "M" "H" "A" "J" "Y" "C" "C" "Y" "S" "P" "J" "W" "J" "H" "E" "B" "Z" "X" "T" [67] "B" "M" "I" "P" "V" "I" "H" "M" "D" "I" "T" "L" "J" "F" "M" "B" "J" "E" "G" "K" "E" "U" "F" "U" "T" "L" "B" "Z" "U" "X" "U" "P" "D" [100] "W" ls[[4]] <- df Ls[[4]] Bool Int String 1 FALSE 1 Q 2 TRUE 2 Z 3 TRUE 3 R 4 FALSE 4 M 5 TRUE 5 K 6 TRUE 6 T Ls[[4]][2:5,1] [1] TRUE TRUE FALSE TRUE names(Ls) [1] "Bool" "Int" "String" "Copy the code
There is a lot to learn about data in R, but understanding how to use vectors, matrices, data frames, and lists can help you get started.
Rstudio is the master of the IDE world
Rstudio is the only IDE you need to use when writing R scripts. It puts all the tools you need in one place, and it’s integral to the integrated use of Sweave and R Markdown. These tools can be used to create documents that contain code, output, diagrams, and text. The creation of these documents is scripted and therefore fully replicable.
Another great tool for RStudio is the R debugger, which is integrated into the IDE. Listing 6 shows the debugger in action.
Listing 6. Practical use of the debugger
testF <- function(x,y){
if(length(x) == length(y)){testF1(x,y)}
}
testF1 <- function(x,y){
print(cbind(x,y))
}
x <- c(1:10)
y <- c(11:21)
testF(x,y)Copy the code
If you were hoping to print x and y, you’d be disappointed if you couldn’t. If you cannot determine the cause for yourself, you can start the debugger by typing the following command into the console (see Figure 1).
> debug(testF)
> testF(x,y)Copy the code
Figure 1. Debugger
Now you can access the debugger and step through the function or go to the next line, as shown in Figure 2.
Figure 2. Step debugging a function
A Traceback pane also opens, along with a specific debug pane (see Figure 3).
Figure 3. Debug pane
Admittedly, this example is not particularly important, but what is important is that you can see that RStudio really is a one-stop place to solve your R script needs.
4. How to apply it by yourself
R is great, but you’ll often run into situations where using a for loop causes scripts to run for too long. Listing 7 shows an example of such a script. Basically, this is a huge dataset of 1,000,000 records that need to be checked and assigned to one variable based on the values of two variables.
Listing 7. An example of a for loop
df <- data.frame(Strings=sample(c("This","That","The Other"),1000000,replace=T), Values=sample(c(0:4),1000000,replace=T),Result = rep("DA",1000000)) for(i in 1:length(df[,1])){ if(df$Strings[i] == "This"){ if(df$Values[i] > 2){ df$Results[i] <- "CP" next }else{ df$Results[i] <- "QM" } }else if(df$Strings[i] == "That"){ df$Results[i] <- "BS" next }else if(df$Strings[i] == "The Other"){ if(df$Values[i] == 4){ df$Results[i] <- "FP" next }else{ df$Results[i] <- "DT" } } }Copy the code
Note the nested if statement (lines 7-10). Nesting in the for loop can disrupt operations in R. Depending on your hardware, this for loop can take all night to run.
This is where the Apply series of functions come in handy. Apply and all similar functions: lapply, mapply, and so on apply a function to a row or column of a matrix object. Apply has two important uses, one of which is to speed things up.
Listing 8 shows an example of the Apply function. The function applyF takes the previous for loop and converts it into a function. This function can then be applied to each row in the data frame and the return values stored as a list in the results. With a quick conversion, you can store a result with the correct format in a data frame variable. This procedure performs the same task as the for loop, but in a matter of seconds.
Listing 8. Apply function
applyF <- function(vex){
if(vex[1] == "This"){
if(vex[2] > 2){
return("CP")
}else{
return("QM")
}
}else if(vex[1] == "That"){
return("BS")
}else if(vex[1] == "The Other"){
if(vex[2] == 4){
return("FP")
}else{
return("DT")
}
}
}
results <- apply(df,1,applyF)
df$Result <- factor(unlist(results))Copy the code
Of course, there’s a big difference between seconds and hours. Listing 9 shows a much faster result.
Listing 9. Faster results
df2 <- data.frame(A=sample(as.character(c(1:100)),1000,replace=T),B=sample(as. character(c(1:100)),1000,replace=T), C=sample(as.character(c(1:100)),1000,replace=T),D=sample(as.character(c (1:100)),1000,replace=T), E=sample(as.character(c(1:100)),1000,replace=T),F=sample(as.character(c (1:100)),1000,replace=T)) df2[,1:6] <- The apply (df2, 1, the as. Numeric)Copy the code
Another benefit of the Apply function is that it simplifies the code. In addition, there is a data frame consisting of integers stored as characters. You can convert each variable in a variety of ways. You can change them all to numbers using the apply and as.numeric functions. You can then perform a large-scale transformation in a very short line of code. The final example is a more common use of apply:
Vars < -apply (df2,2,var) Vars A B C D E F 831.8953 810.2209 806.5781 854.8382 820.8769 866.8276Copy the code
If you want to know the variance of each variable in a data frame, just specify the data frame, the index (in this case, “2” for the column), and the function. The output shows that you now have the variance for each column. If you want to become an effective R user, it is important to understand Apply and its related functions.
5. The base graphics are great, and so is GGploT2
R uses the base package to create very useful graphics, but it’s only after you master GGplot2 that your graphics really stand out. Let’s look at some examples using base graphics and GGplot2.
The script in Listing 10 generates the images shown in Figures 4 and 5.
Listing 10. A base package result versus the GGplot2 result
library(ggplot2)
data(list=esoph)
barplot(xtabs(esoph$ncases~esoph$tobg+esoph$alcgp),beside=TRUE,col=rainbow(4)
,
main="Number of Cancer cases by Alcohol and Tobacco Use
Groups",xlab="Alcohol Use Group",ylab="Cases")
legend("topright",legend=levels(esoph$tobgp),fill=rainbow(4),title="Tobacco
Use Group")
ggplot(esoph,aes(x=alcgp,y=ncases,fill=tobgp))+
geom_bar(position="dodge",stat="identity")+
labs(fill="Tobacco Use Group",x="Acohol Use Group",y="Cases",title="Number
of Cancer cases by Alcohol and Tobacco Use Groups")Copy the code
Figure 4. Base package results
Figure 5. Ggplot2 results
Both are good. They all provide the necessary information, but the GGploT2 version is prettier.
Let’s look at another example. The script in Listing 11 compares German and Swiss stocks, showing both the base package results and the GGplot2 results.
Listing 11. An example stock showing the base package results and ggplot2 results
EUst <- EuStockMarkets plot(EUst[,1],ylab="Euros",main="German and Swiss Stock Time Series Comparison") lines(EUst[,2],col="Blue") legend("topleft",legend=c("German","Swiss"),col=c("Black","Blue"),lty=1) df <- data.frame(Year = as.double(time(EUst)),German= as.double(EUst[,1]),Swiss = as.double(EUst[,2])) ggplot(df,aes(x=Year))+ geom_line(aes(y=df$Swiss,col="Swiss"))+ geom_line(aes(y=df$German,col="German"))+ labs(color="",y="Euros",title="German and Swiss Stock Time Series Comparison")Copy the code
Figure 6. German and Swiss stocks using base packages
Figure 7. German and Swiss stocks using GGploT2
There’s nothing wrong with using basic graphics, which are better than some of the graphics created in languages like R, but the GGplot2 version is a little more aesthetically pleasing. In addition, creating graphs using GGplot2 takes a more intuitive approach than creating graphs using base packages.
The final example (Listing 12) shows how GGplot2 displays data in a better way than the base package (see Figures 8 and 9).
This is a classic iris data set that is widely used. Here you can see the K-means clustering based on the physical attributes of these flowers, rather than the clustering of species classification. There are three categories, so you have three clusters.
Listing 12. The power of GGplot2
iris <- iris wssplot <- function(data, nc=15, seed=1234){ wss <- (nrow(data)-1)*sum(apply(data,2,var)) for (i in 2:nc){ set.seed(seed) wss[i] <- sum(kmeans(data, centers=i)$withinss)} plot(1:nc, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")} fit <- kmeans(iris[,1:4], 3) iris$Cluster < -fit $Cluster par(mfrow=c(1,2)) plot(iris$Sepal.Length,iris$Sepal.Width,col=iris$Cluster,pch=16,xlab="Sepal Length",ylab="Sepal Width", main="Iris Data Colored by Cluster") legend("topright",legend=c(1:3),col=c(1:3),pch=16) plot(iris$Sepal.Length,iris$Sepal.Width,col=iris$Species,pch=16,xlab="Sepal Length",ylab="Sepal Width", Main ="Iris Data Colored by Species") legend("topright",legend=levels(Iris $Species),col=c(1:3), PCH =16) par(mfrow=c(1,1)) ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width))+ geom_point(aes(colour=Species))+ facet_grid(Cluster~.) + labs(x="Sepal Length",y="Sepal Width",title="Iris Species-Cluster Relationship")Copy the code
The graph of the basic package shows that the species of Iris scallop are consistent with this cluster. The graph also shows that the cluster does not correctly identify iris variegata and Iris Virginia, but it only tells us the above content.
Figure 8. Iris data using the base pack
The GGploT2 diagram stacks these clusters and then colors each point according to category. The relationship between the variegated iris and the Virginia iris cluster species is easier to identify, and you can also start to see why this relationship might exist. The Virginia irises classified as the wrong species had shorter sepals than the others. That’s probably a good explanation.
Figure 9. Iris data using GGploT2
When you’re not too concerned with appearance, base graphics are great for actually analyzing data. Knowing GGploT2 helps make your data more visible when presenting it to the outside world.
6.R is just the tool you want or need
R can do a lot of things that you might leave to a less data-analysis-driven language. Listing 13 is an example of using R in a more generic scripting way.
This query is performed using HTTP. The HTTR package has a GET function that is used only for standard HTTP GET requests. The Google Places API has HTTP capabilities, where a specific URL can initiate a query with a GET request. The URL starts at lines 4 and 5 (it doesn’t contain my key, but you can get yours from Google). Then, within the QGoogle function, in lines 13-22, this particular query is built, the GET request is executed, and the results are parsed.
Listing 13.r’s powerful scripting capabilities
library('httr') library('rjson') library('stringr') preamble <- 'https://maps.googleapis.com/maps/api/place/textsearch/json?' key < - 'key = yourkeyhere' for (I in 1: length (dataset [1])) { dataset[i,] <- tryCatch(qgoogle(dataset[i,]), error = function(err){ print(paste("ERROR: ",err)) }) } qgoogle <- function(vex){ name <- str_replace_all(vex$BUSINESS," ","+") line_two <- str_replace_all(vex$BUSINESS2," ","+") city <- str_replace_all(vex$CITY," ","+") addr <- str_replace_all(vex$CLEANADDRE," ","+") if(line_two == ""){ query <- paste(name,addr,city,state,sep="+") }else{ query <- paste(name,line_two,addr,city,state,sep="+") } url <- paste(preamble,'&',"query=",query,'&',key,sep = "") json.obj <- GET(url) content <- content(json.obj) if(content$status ! = "ZERO_RESULTS") { vex$DATA <- TRUE vex$DATA.WITH.ADDRESS <- TRUE vex$NAME <- content$results[[1]]$name vex$ADDR <- content$results[[1]]$formatted_address vex$LAT <- content$results[[1]]$geometry$location$lat vex$LONG <- content$results[[1]]$geometry$location$lng if(length(content$results[[1]]$types) ! = 0){ vex$TYPE <- content$results[[1]]$types[[1]] } if(length(content$results[[1]]$permanently_closed) ! = 0){ vex$CLOSED <- "Permanently Closed" } } else { vex$NAME <- NA vex$ADDR <- NA vex$LAT <- NA vex$LONG <- NA vex$TYPE <- NA vex$CLOSED <- NA vex$DATA <- FALSE vex$DATA.WITH.ADDRESS <- FALSE } return(vex) }Copy the code
R is not a perfect substitute for other scripting languages, but the examples above show that R can perform many of the same tasks that any other scripting language can perform.
7. Rcpp is very good
Rcpp is a package that can be used to import C++ functions into R scripts. Here is a standard C++ example of a function in R:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int timesTwo(int x) {
return x * 2;
}Copy the code
You can create this example in R, but the point here is to show how easy it is to create a C++ function and then migrate that function to an R environment. In addition, RStudio makes this process easier to manage. If you need to implement a really nice feature, and you can use C++ to do it, you can easily integrate this feature into any R script you want.
conclusion
This article has only scratched the surface of R’s capabilities and what you need to know to use it. These seven tips are important and can help you save some time and eliminate some headaches as you embark on your journey with R. Have fun developing scripts.
On the topic
- 5 Reasons you Should Learn R now
- IBM Data Science Experience
- Code for this article (GitHub)