In this tutorial, we will explore various options for loading and saving data into R from plain-text files, R files, and excel spreadsheets. It will also demonstrate the R packages that you can use to load data from databases and other common programs, like SAS and MATLAB.
R comes with many data sets pre-loaded in the datasets package, which comes with base R. One can list the pre-loaded datasets as well as the short descriptions of each by running:
> help(package=”dataset name”)
> setwd(“/location or directory path/”) #This is to set the working directory at some specified location.
> getwd()
you can place the data files straight into the folder that is the working directory, or you can move your working directory to where your data files are. You can move your working directory to any folder on your computer with the function setwd. Just give setwd the file path to your new working directory.
If the file path does not begin with your root directory, R will assume that it begins at your current working directory.
You can also change your working directory by clicking on Session > Set Working Directory > Choose Directory in the RStudio menu bar. The Windows and Mac GUIs have similar options. If you start R from a UNIX command line (as on Linux machines), the working directory will be whichever directory you were in when you called R.
You can see what files are in your working directory with list.files(). If you see the file that you would like to open in your working directory, then you are ready to proceed. How you open files in your working directory will depend on which type of file you would like to open.
Plain-text files are one of the most common ways to save data. They are very simple and can be read by many different computer programs—even the most basic text editors. For this reason, public data often comes as plain-text files.
You can use load.table() functionalities to read/load a plain-test file.
The following table shares a quick overview of the R’s read functions in which you can always overwrite any of the default arguments as necessary for your data type:
Functions | Defaults | Use |
---|---|---|
read.table | sep = " “, header = FALSE | General read function |
read.csv | sep = “,”, header = TRUE | To read the csv files |
read.delim | sep = “\t”, header = TRUE | To read the tab-delimited file |
read.csv2 | sep = “;”, header = TRUE,dec = “,” | To read CSV file with European decimal format |
read.delim2 | sep = “\t”, header = TRUE,dec = “,” | To read the tab-delimited files with European decimal format. |
There is one specific type of plain-text file that defies the pattern by using its layout to separate data cells. Each row is placed in its own line (as with other plain-text files), and then each column begins at a specific number of characters from the left-hand side of the document. To achieve this, an arbitrary number of character spaces is added to the end of each entry to correctly position the next entry. These documents are known as fixed-width files and usually, end with the extension “.fwf”. R has its own function to read such file – read.fwf.
There are many data files which are available on the Internet at their own web address. If you are connected to the Internet, you can easily open these files straight into R with read.table, read.csv, etc. You need to pass a web address into the file name argument for any of R’s data-reading functions. As a result, you could read in the iris data set from a web address like http://.../poker.csv with:
poker <- read.csv("http://.../iris.csv")
Once your data is in R, you can save it to any file format that R supports. If you’d like to save it as a plain-text file, you can use the ‘write’ family of functions.
R saves data sets to plain-text files with the write family of functions:
File Format | Function and Syntax |
---|---|
.csv | write.csv(r_object, file = filepath, row.names = FALSE) |
.csv (with european decimal notation) | write.csv2(r_object, file = filepath, row.names = FALSE) |
tab delimited | write.table(r_object, file = filepath, sep = "\t", row.names=FALSE) |
The first argument of each function is the R object that contains your data set. The file argument is the file name (including extension) that you wish to give the saved data. By default, each function will save your data into your working directory.
Sometimes it becomes quite important to save a big output file and share with others. Compression is an effective way to deal in such a scenario.
To compress a plain-text file, surround the file name or file path with the function bzfile, gzfile, or xzfile. For example:
> write.csv(poker, file = bzfile("data/poker.csv.bz2"), row.names = FALSE)
There are 2 more functions that will also compress the file with two different types of compression format.
Function | Compression Type |
---|---|
bzfile | bzip2 |
gzfile | gnu zip(gzip) |
xzfile | xz compression |
R provides two file formats of its own for storing data, .RDS and .RData. RDS files can store a single R object, and RData files can store multiple R objects.
You can open an RDS file with readRDS. For example, if the iris data was saved as iris.RDS, you could open it with:
poker <- readRDS("iris.RDS")
Opening RData files is even easier. Simply run the function load with the file:
load("file.RData")
There’s no need to assign the output to an object. The R objects in your RData file will be loaded into your R session with their original names. RData files can contain multiple R objects, so loading one may read in multiple objects. load doesn’t tell you how many objects it is reading in, nor what their names are, so it pays to know a little about the RData file before you load it.
To save data as an RData object, one can use the save function. To save data as an RDS object, use the saveRDS function. In each case, the first argument should be the name of the R object you wish to save. You should then include a file argument that has the file name or file path you want to save the data set to.
For example, if you have three R objects, a, b, and c, you could save them all in the same RData file and then reload them in another R session:
a<-5 b<-90 c<-100 save(a, b, c, file = "test.RData") load("test.RData")
There are quite a few numbers of packages to help you read Excel files directly into R. Unfortunately, many of these packages do not work on all operating systems. Others have been made out of date by the .xlsx file format. One package that does work on all file systems (and gets good reviews) is the XLConnect package. To use it, you’ll need to install and load the package:
install.packages("XLConnect") library(XLConnect)
XLConnect relies on Java to be platform-independent. So when you first open XLConnect, RStudio may ask to download a Java Runtime Environment if you do not already have one.
A good number of functions are available in R, which attempt to read the file types of other data- analysis programs:
File Format | Function | Library |
---|---|---|
MATLAB | readMat | R.matlab |
minitab | read.mtp | foreign |
SAS | read.ssd | foreign |
SAS (XPORT format) | read.xport | foreign |
SPSS | read.spss | Foreign |
Stata | read.dta | Foreign |
Systat | read.systat | foreign |
ERSI ArcGIS | read.shapefile | shapefiles |
One great aspect of R is that you can also use R to connect to a database and read in data. How you do this will depend on the database management system that you use. Working with a database will usually require experience that goes beyond the skill set of a typical R user. However, if you are interested in doing this, the best place to start is by downloading these R packages and reading their documentation.
Use the RODBC package to connect to databases through an ODBC connection.
Use the DBI package to connect to databases through individual drivers. The DBI package provides a common syntax for working with different databases. You will have to download a database-specific package to use in conjunction with DBI. These packages provide the API for the native drivers of different database programs. For MySQL use RMySQL, for SQLite use RSQLite, for Oracle use ROracle, for PostgreSQL use RPostgreSQL, and for databases that use drivers based on the Java Database Connectivity (JDBC) API use RJDBC. Once you have loaded the appropriate driver package, you can use the commands provided by DBI to access your database.
Hope you enjoyed this tutorial and in our next tutorial, we will discuss how to generate visualization using R.
Thanks for this info.
C# is an object-oriented programming developed by Microsoft that uses ...
Leave a Reply
Your email address will not be published. Required fields are marked *