Read Csv and Convert First Column as Row Names in R
Reading and Writing CSV Files
Overview
Teaching: 30 min
Exercises: 0 minQuestions
How practise I read data from a CSV file into R?
How do I write data to a CSV file?
Objectives
Read in a .csv, and explore the arguments of the csv reader.
Write the altered data fix to a new .csv, and explore the arguments.
The nigh common fashion that scientists shop data is in Excel spreadsheets. While at that place are R packages designed to access data from Excel spreadsheets (e.g., gdata, RODBC, XLConnect, xlsx, RExcel), users oftentimes find it easier to save their spreadsheets in comma-separated values files (CSV) and then use R'south congenital in functionality to read and manipulate the data. In this curt lesson, nosotros'll acquire how to read data from a .csv and write to a new .csv, and explore the arguments that permit you lot read and write the information correctly for your needs.
Read a .csv and Explore the Arguments
Let'due south outset by opening a .csv file containing information on the speeds at which cars of dissimilar colors were clocked in 45 mph zones in the four-corners states (CarSpeeds.csv
). We will utilize the built in read.csv(...)
function call, which reads the information in every bit a data frame, and assign the data frame to a variable (using <-
) then that it is stored in R'southward memory. Then we will explore some of the bones arguments that can be supplied to the part. First, open the RStudio project containing the scripts and data you were working on in episode 'Analyzing Patient Data'.
# Import the data and look at the start six rows carSpeeds <- read.csv ( file = 'data/car-speeds.csv' ) head ( carSpeeds )
Colour Speed State 1 Blue 32 NewMexico two Red 45 Arizona iii Blue 35 Colorado iv White 34 Arizona 5 Carmine 25 Arizona six Blueish 41 Arizona
Changing Delimiters
The default delimiter of the
read.csv()
function is a comma, but you can use other delimiters by supplying the 'sep' argument to the office (east.one thousand., typingsep = ';'
allows a semi-colon separated file to be correctly imported - come across?read.csv()
for more data on this and other options for working with dissimilar file types).
The phone call higher up will import the data, but nosotros have not taken advantage of several handy arguments that can be helpful in loading the data in the format nosotros want. Permit'due south explore some of these arguments.
The default for read.csv(...)
is to set the header
argument to TRUE
. This means that the beginning row of values in the .csv is prepare equally header information (column names). If your information set up does not have a header, fix the header
argument to False
:
# The first row of the data without setting the header argument: carSpeeds [ 1 , ]
Color Speed State 1 Blue 32 NewMexico
# The first row of the data if the header argument is fix to Imitation: carSpeeds <- read.csv ( file = 'information/automobile-speeds.csv' , header = Simulated ) carSpeeds [ 1 , ]
V1 V2 V3 one Colour Speed Land
Clearly this is not the desired beliefs for this data prepare, simply information technology may be useful if you take a dataset without headers.
The stringsAsFactors
Statement
In older versions of R (prior to 4.0) this was maybe the most important statement in read.csv()
, particularly if you lot were working with chiselled data. This is considering the default behavior of R was to catechumen character strings into factors, which may make it hard to do such things as replace values. Information technology is of import to be enlightened of this behaviour, which we will demonstrate. For example, permit's say we discover out that the data collector was colour blind, and accidentally recorded dark-green cars equally being blue. In order to correct the data gear up, let'southward supervene upon 'Blue' with 'Green' in the $Colour
cavalcade:
# Hither we will use R'southward `ifelse` function, in which nosotros provide the test phrase, # the upshot if the effect of the test is 'TRUE', and the issue if the # result is 'Imitation'. Nosotros will also assign the results to the Colour cavalcade, # using '<-' # First - reload the data with a header carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , stringsAsFactors = TRUE ) carSpeeds $ Color <- ifelse ( carSpeeds $ Colour == 'Bluish' , 'Green' , carSpeeds $ Color ) carSpeeds $ Color
[ane] "Green" "1" "Green" "5" "4" "Green" "Green" "ii" "five" [x] "4" "4" "five" "Green" "Green" "ii" "4" "Dark-green" "Green" [19] "5" "Dark-green" "Green" "Greenish" "4" "Green" "4" "4" "four" [28] "4" "5" "Green" "iv" "5" "2" "4" "2" "2" [37] "Dark-green" "4" "2" "iv" "2" "2" "4" "iv" "five" [46] "two" "Light-green" "4" "four" "two" "two" "iv" "v" "iv" [55] "Greenish" "Green" "2" "Dark-green" "v" "two" "4" "Green" "Green" [64] "5" "2" "4" "4" "two" "Green" "v" "Greenish" "4" [73] "v" "v" "Green" "Green" "Green" "Green" "Greenish" "5" "2" [82] "Green" "five" "2" "2" "4" "4" "5" "5" "v" [91] "5" "4" "four" "4" "v" "2" "5" "2" "2" [100] "5"
What happened?!? It looks similar 'Blue' was replaced with 'Light-green', but every other colour was turned into a number (every bit a character string, given the quote marks before and after). This is because the colors of the cars were loaded equally factors, and the factor level was reported following replacement.
To see the internal structure, we can use another office, str()
. In this instance, the dataframe'due south internal structure includes the format of each column, which is what we are interested in. str()
volition be reviewed a little more than in the lesson Data Types and Structures.
# Reload the information with a header (the previous ifelse call modifies attributes) carSpeeds <- read.csv ( file = 'information/automobile-speeds.csv' , stringsAsFactors = True ) str ( carSpeeds )
'information.frame': 100 obs. of 3 variables: $ Colour: Factor w/ five levels " Ruby","Black",..: 3 ane 3 five 4 3 three 2 5 4 ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ State: Factor w/ 4 levels "Arizona","Colorado",..: 3 1 ii 1 i one 3 2 ane 2 ...
Nosotros can see that the $Color
and $State
columns are factors and $Speed
is a numeric column.
Now, let'south load the dataset using stringsAsFactors=FALSE
, and see what happens when we try to supercede 'Blue' with 'Green' in the $Color
cavalcade:
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , stringsAsFactors = FALSE ) str ( carSpeeds )
'information.frame': 100 obs. of 3 variables: $ Colour: chr "Blue" " Red" "Bluish" "White" ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ State: chr "NewMexico" "Arizona" "Colorado" "Arizona" ...
carSpeeds $ Color <- ifelse ( carSpeeds $ Color == 'Blueish' , 'Dark-green' , carSpeeds $ Color ) carSpeeds $ Color
[i] "Greenish" " Red" "Dark-green" "White" "Red" "Green" "Light-green" "Black" "White" [ten] "Ruby-red" "Carmine" "White" "Dark-green" "Green" "Black" "Red" "Green" "Green" [19] "White" "Green" "Light-green" "Greenish" "Ruby" "Green" "Carmine" "Red" "Cerise" [28] "Red" "White" "Light-green" "Scarlet" "White" "Blackness" "Red" "Black" "Black" [37] "Green" "Red" "Black" "Red" "Blackness" "Black" "Red" "Red" "White" [46] "Black" "Green" "Reddish" "Red" "Black" "Black" "Carmine" "White" "Cherry-red" [55] "Green" "Green" "Black" "Green" "White" "Black" "Red" "Light-green" "Green" [64] "White" "Blackness" "Red" "Red" "Black" "Green" "White" "Green" "Ruddy" [73] "White" "White" "Light-green" "Green" "Green" "Green" "Dark-green" "White" "Blackness" [82] "Green" "White" "Black" "Black" "Blood-red" "Red" "White" "White" "White" [91] "White" "Ruby" "Crimson" "Red" "White" "Black" "White" "Blackness" "Blackness" [100] "White"
That'south ameliorate! And we tin encounter how the information at present is read as character instead of factor. From R version 4.0 onwards we exercise not have to specify stringsAsFactors=FALSE
, this is the default behavior.
The as.is
Argument
This is an extension of the stringsAsFactors
statement, but gives you control over individual columns. For instance, if we want the colors of cars imported as strings, but nosotros want the names of the states imported as factors, we would load the information set as:
carSpeeds <- read.csv ( file = 'data/car-speeds.csv' , as.is = 1 ) # Note, the 1 applies as.is to the outset column but
At present we tin can encounter that if nosotros try to supervene upon 'Blue' with 'Green' in the $Colour
column everything looks fine, while trying to supercede 'Arizona' with 'Ohio' in the $Country
cavalcade returns the factor numbers for the names of states that we haven't replaced:
'information.frame': 100 obs. of iii variables: $ Colour: chr "Bluish" " Red" "Blue" "White" ... $ Speed: int 32 45 35 34 25 41 34 29 31 26 ... $ State: Factor w/ 4 levels "Arizona","Colorado",..: 3 1 2 ane one 1 3 ii 1 2 ...
carSpeeds $ Colour <- ifelse ( carSpeeds $ Color == 'Blueish' , 'Dark-green' , carSpeeds $ Color ) carSpeeds $ Color
[1] "Green" " Red" "Green" "White" "Red" "Green" "Dark-green" "Black" "White" [x] "Red" "Red" "White" "Light-green" "Green" "Blackness" "Red" "Green" "Green" [nineteen] "White" "Light-green" "Green" "Green" "Ruby-red" "Light-green" "Red" "Cherry-red" "Red" [28] "Ruby" "White" "Green" "Carmine" "White" "Black" "Blood-red" "Black" "Black" [37] "Dark-green" "Red" "Black" "Red" "Black" "Black" "Crimson" "Red" "White" [46] "Blackness" "Green" "Red" "Cherry" "Black" "Blackness" "Red" "White" "Red" [55] "Green" "Dark-green" "Black" "Green" "White" "Black" "Red" "Light-green" "Dark-green" [64] "White" "Black" "Ruby-red" "Crimson" "Black" "Light-green" "White" "Green" "Cherry" [73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White" "Black" [82] "Green" "White" "Black" "Black" "Red" "Red" "White" "White" "White" [91] "White" "Red" "Red" "Ruddy" "White" "Black" "White" "Black" "Black" [100] "White"
carSpeeds $ State <- ifelse ( carSpeeds $ State == 'Arizona' , 'Ohio' , carSpeeds $ State ) carSpeeds $ State
[one] "3" "Ohio" "two" "Ohio" "Ohio" "Ohio" "three" "2" "Ohio" "two" [11] "4" "4" "four" "four" "iv" "three" "Ohio" "iii" "Ohio" "iv" [21] "4" "4" "iii" "2" "2" "3" "ii" "4" "2" "iv" [31] "3" "2" "ii" "4" "2" "two" "three" "Ohio" "4" "2" [41] "2" "3" "Ohio" "4" "Ohio" "2" "three" "3" "3" "ii" [51] "Ohio" "4" "4" "Ohio" "3" "ii" "4" "ii" "4" "four" [61] "4" "2" "iii" "2" "3" "2" "3" "Ohio" "iii" "4" [71] "4" "2" "Ohio" "four" "2" "2" "2" "Ohio" "3" "Ohio" [81] "4" "2" "two" "Ohio" "Ohio" "Ohio" "4" "Ohio" "4" "iv" [91] "4" "Ohio" "Ohio" "three" "2" "2" "iv" "3" "Ohio" "iv"
Nosotros tin see that $Color
column is a character while $Country
is a factor.
Updating Values in a Factor
Suppose nosotros want to keep the colors of cars equally factors for another operations we desire to perform. Write code for replacing 'Blue' with 'Green' in the
$Color
column of the cars dataset without importing the data withstringsAsFactors=FALSE
.Solution
carSpeeds <- read.csv ( file = 'data/machine-speeds.csv' ) # Supplant 'Blue' with 'Green' in cars$Color without using the stringsAsFactors # or as.is arguments carSpeeds $ Colour <- ifelse ( as.grapheme ( carSpeeds $ Colour ) == 'Blue' , 'Greenish' , as.character ( carSpeeds $ Colour )) # Catechumen colors back to factors carSpeeds $ Color <- as.factor ( carSpeeds $ Color )
The strip.white
Argument
It is not uncommon for mistakes to accept been made when the data were recorded, for example a space (whitespace) may have been inserted before a data value. Past default this whitespace volition exist kept in the R environs, such that '\ Red' will exist recognized as a unlike value than 'Red'. In order to avoid this type of error, use the strip.white
argument. Let's see how this works by checking for the unique values in the $Colour
column of our dataset:
Here, the data recorder added a infinite before the color of the car in one of the cells:
# We use the built-in unique() part to extract the unique colors in our dataset unique ( carSpeeds $ Color )
[ane] Dark-green Red White Red Black Levels: Ruby Black Dark-green Cherry-red White
Oops, we see two values for red cars.
Permit's try again, this time importing the information using the strip.white
statement. Note - this argument must be accompanied by the sep
argument, by which we indicate the blazon of delimiter in the file (the comma for most .csv files)
carSpeeds <- read.csv ( file = 'data/motorcar-speeds.csv' , stringsAsFactors = Fake , strip.white = Truthful , sep = ',' ) unique ( carSpeeds $ Color )
[i] "Blue" "Red" "White" "Black"
That's ameliorate!
Specify Missing Data When Loading
It is common for data sets to have missing values, or mistakes. The convention for recording missing values oftentimes depends on the individual who collected the data and can exist recorded equally
n.a.
,--
, or empty cells " ". R recognises the reserved character stringNA
as a missing value, but not some of the examples above. Allow's say the inflamation scale in the data gear up we used beforeinflammation-01.csv
actually starts at1
for no inflamation and the zero values (0
) were a missed observation. Looking at the?read.csv
help folio is there an argument we could apply to ensure all zeros (0
) are read in equallyNA
? Possibly, in thecar-speeds.csv
data contains mistakes and the person measuring the automobile speeds could not accurately distinguish betwixt "Black or "Blue" cars. Is there a way to specify more than ane 'cord', such every bit "Black" and "Blue", to be replaced byNA
Solution
read.csv ( file = "data/inflammation-01.csv" , na.strings = "0" )
or , in
car-speeds.csv
use a grapheme vector for multiple values.read.csv ( file = 'data/car-speeds.csv' , na.strings = c ( "Black" , "Blue" ) )
Write a New .csv and Explore the Arguments
After altering our cars dataset by replacing 'Blue' with 'Green' in the $Colour
cavalcade, we at present want to save the output. There are several arguments for the write.csv(...)
function call, a few of which are peculiarly important for how the data are exported. Let's explore these now.
# Export the data. The write.csv() function requires a minimum of two # arguments, the data to exist saved and the name of the output file. write.csv ( carSpeeds , file = 'data/car-speeds-cleaned.csv' )
If you open the file, you'll see that it has header names, because the data had headers within R, simply that in that location are numbers in the starting time cavalcade.
The row.names
Argument
This argument allows u.s. to gear up the names of the rows in the output data file. R's default for this statement is TRUE
, and since it does non know what else to proper noun the rows for the cars data fix, it resorts to using row numbers. To correct this, we can set row.names
to FALSE
:
write.csv ( carSpeeds , file = 'data/machine-speeds-cleaned.csv' , row.names = FALSE )
Now we see:
Setting Cavalcade Names
In that location is also a
col.names
statement, which tin can be used to set the column names for a data set without headers. If the data set already has headers (e.g., we used theheaders = True
argument when importing the information) then acol.names
argument will be ignored.
The na
Argument
There are times when we want to specify certain values for NA
due south in the information set (e.g., we are going to laissez passer the data to a program that only accepts -9999 equally a nodata value). In this case, we desire to set the NA
value of our output file to the desired value, using the na argument. Let'due south encounter how this works:
# Commencement, replace the speed in the 3rd row with NA, past using an alphabetize (square # brackets to indicate the position of the value we want to replace) carSpeeds $ Speed [ 3 ] <- NA caput ( carSpeeds )
Color Speed State 1 Blue 32 NewMexico 2 Red 45 Arizona three Blue NA Colorado four White 34 Arizona 5 Red 25 Arizona half-dozen Blue 41 Arizona
write.csv ( carSpeeds , file = 'data/motorcar-speeds-cleaned.csv' , row.names = FALSE )
Now nosotros'll set NA
to -9999 when we write the new .csv file:
# Note - the na argument requires a string input write.csv ( carSpeeds , file = 'data/automobile-speeds-cleaned.csv' , row.names = FALSE , na = '-9999' )
And we see:
Cardinal Points
Import data from a .csv file using the
read.csv(...)
office.Understand some of the key arguments available for importing the data properly, including
header
,stringsAsFactors
,as.is
, andstrip.white
.Write data to a new .csv file using the
write.csv(...)
roleUnderstand some of the key arguments available for exporting the data properly, such every bit
row.names
,col.names
, andna
.
Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/
0 Response to "Read Csv and Convert First Column as Row Names in R"
Postar um comentário