R For Dummies. Vries Andrie de
It’s important to stress that the GPL does not pertain to your usage of R. There are no obligations for using the software – the obligations just apply to redistribution. In short, if you change and redistribute the R source code, you have to make those changes available for anybody else to use.
The R Core Team has put a lot of effort into making R available for different types of hardware and software. This means that R is available for Windows, Unix systems (such as Linux), and the Mac.
R itself is a powerful language that performs a wide variety of functions, such as data manipulation, statistical modeling, and graphics. One really big advantage of R, however, is its extensibility. Developers can easily write their own software and distribute it in the form of add-on packages. Because of the relative ease of creating and using these packages, literally thousands of packages exist. In fact, many new (and not-so-new) statistical methods are published with an R package attached.
The R user base keeps growing. Many people who use R eventually start helping new users and advocating the use of R in their workplaces and professional circles. Sometimes they also become active on
✔ The R mailing lists (http://www.r-project.org/mail.html
✔ Question-and-answer (Q&A) websites, such as
● StackOverflow, a programming Q&A website (www.stackoverflow.com/questions/tagged/r)
● CrossValidated, a statistics Q&A website (http://stats.stackexchange.com/questions/tagged/r)
In addition to these mailing lists and Q&A websites, R users may
✔ Blog actively (www.r-bloggers.com).
✔ Participate in social networks such as Twitter (www.twitter.com/search/rstats).
✔ Attend regional and international R conferences.
See Chapter 11 for more information on R communities.
As more and more people moved to R for their analyses, they started trying to incorporate R in their previous workflows. This led to a whole set of packages for linking R to file systems, databases, and other applications. Many of these packages have since been incorporated into the base installation of R.
For example, the R package foreign
(http://cran.r-project.org/web/packages/foreign/index.html) forms part of the recommended packages of R and enables you to read data from the statistical packages SPSS, SAS, Stata, and others (see Chapter 12).
Several add-on packages exist to connect R to database systems, such as
✔ RODBC
, to read from databases using the Open Database Connectivity protocol (ODBC) (http://cran.r-project.org/web/packages/RODBC/index.html)
✔ ROracle
, to read Oracle data bases (http://cran.r-project.org/web/packages/ROracle/index.html).
Initially, most of R was based on Fortran and C. Code from these two languages easily could be called from within R. As the community grew, C++, Java, Python, and other popular programming languages got more and more connected with R.
As more data analysts started using R, the developers of commercial data software no longer could ignore the new kid on the block. Many of the big commercial packages have add-ons to connect with R. Notably, both IBM’s SPSS and SAS Institute’s SAS allow you to move data and graphics between the two packages, and also call R functions directly from within these packages.
Other third-party developers also have contributed to better connectivity between different data analysis tools. For example, Statconn developed RExcel, an Excel add-on that allows users to work with R from within Excel (http://www.statconn.com/products.html).
Looking At Some of the Unique Features of R
R is more than just a domain-specific programming language aimed at data analysis. It has some unique features that make it very powerful, the most important one arguably being the notion of vectors. These vectors allow you to perform sometimes complex operations on a set of values in a single command.
R is a vector-based language. You can think of a vector as a row or column of numbers or text. The list of numbers {1,2,3,4,5}
, for example, could be a vector. Unlike most other programming languages, R allows you to apply functions to the whole vector in a single operation without the need for an explicit loop.
It is time to illustrate vectors with some real R code. First, assign the values 1:5
to a vector called x
:
> x <– 1:5
> x
[1] 1 2 3 4 5
Next, add the value 2
to each element in the vector x
:
> x + 2
[1] 3 4 5 6 7
You can also add one vector to another. To add the values 6:10
element-wise to x
, you do the following:
> x + 6:10
[1] 7 9 11 13 15
To do this in most other programming language would require an explicit loop to run through each value of x
. However, R is designed to perform many operations in a single step. This functionality is one of the features that make R so useful – and powerful – for data analysis.
We introduce the concept of vectors in Chapter 2 and expand on vectors and vectorization in much more depth in Chapter 4.
R was developed by statisticians to make statistical data analysis easier. This heritage continues, making R a very powerful tool for performing virtually any statistical computation.
As R started to expand away from its origins in statistics, many people who would describe themselves as programmers rather than statisticians have become involved with R. The result is that R is now eminently suitable for a wide variety of nonstatistical tasks, including data processing, graphical visualization, and analysis of all sorts. R is being used in the fields of finance, natural language processing, genetics, biology, and market research, to name just a few.
R is Turing complete, which means that you can use R alone to program anything you want. (Not every task is easy to program in R, though.)
In this book, we assume that you want to find out about R programming, not statistics, although we provide an introduction to statistics with R in Part IV.
R is an interpreted language, which means that – contrary to compiled languages like C and Java – you don’t need a compiler to first create a program from your code before you can use it. R interprets the code you provide directly and