Read Csv and Convert First Column as Row Names in R

Reading and Writing CSV Files

Overview

Teaching: 30 min
Exercises: 0 min

Questions

  • How practise I read data from a CSV file into R?

  • How do I write data to a CSV file?

Objectives

  • Read in a .csv, and explore the arguments of the csv reader.

  • Write the altered data fix to a new .csv, and explore the arguments.

The nigh common fashion that scientists shop data is in Excel spreadsheets. While at that place are R packages designed to access data from Excel spreadsheets (e.g., gdata, RODBC, XLConnect, xlsx, RExcel), users oftentimes find it easier to save their spreadsheets in comma-separated values files (CSV) and then use R'south congenital in functionality to read and manipulate the data. In this curt lesson, nosotros'll acquire how to read data from a .csv and write to a new .csv, and explore the arguments that permit you lot read and write the information correctly for your needs.

Read a .csv and Explore the Arguments

Let'due south outset by opening a .csv file containing information on the speeds at which cars of dissimilar colors were clocked in 45 mph zones in the four-corners states (CarSpeeds.csv). We will utilize the built in read.csv(...) function call, which reads the information in every bit a data frame, and assign the data frame to a variable (using <-) then that it is stored in R'southward memory. Then we will explore some of the bones arguments that can be supplied to the part. First, open the RStudio project containing the scripts and data you were working on in episode 'Analyzing Patient Data'.

                          # Import the data and look at the start six rows                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              )                                          head              (              carSpeeds              )                                                  
                          Colour Speed     State 1  Blue    32 NewMexico two   Red    45   Arizona iii  Blue    35  Colorado iv White    34   Arizona 5   Carmine    25   Arizona six  Blueish    41   Arizona                      

Changing Delimiters

The default delimiter of the read.csv() function is a comma, but you can use other delimiters by supplying the 'sep' argument to the office (east.one thousand., typing sep = ';' allows a semi-colon separated file to be correctly imported - come across ?read.csv() for more data on this and other options for working with dissimilar file types).

The phone call higher up will import the data, but nosotros have not taken advantage of several handy arguments that can be helpful in loading the data in the format nosotros want. Permit'due south explore some of these arguments.

The default for read.csv(...) is to set the header argument to TRUE. This means that the beginning row of values in the .csv is prepare equally header information (column names). If your information set up does not have a header, fix the header argument to False:

                          # The first row of the data without setting the header argument:                                          carSpeeds              [              1              ,                                          ]                                                  
                          Color Speed     State 1  Blue    32 NewMexico                      
                          # The first row of the data if the header argument is fix to Imitation:                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'information/automobile-speeds.csv'              ,                                          header                                          =                                          Simulated              )                                          carSpeeds              [              1              ,                                          ]                                                  
                          V1    V2    V3 one Colour Speed Land                      

Clearly this is not the desired beliefs for this data prepare, simply information technology may be useful if you take a dataset without headers.

The stringsAsFactors Statement

In older versions of R (prior to 4.0) this was maybe the most important statement in read.csv(), particularly if you lot were working with chiselled data. This is considering the default behavior of R was to catechumen character strings into factors, which may make it hard to do such things as replace values. Information technology is of import to be enlightened of this behaviour, which we will demonstrate. For example, permit's say we discover out that the data collector was colour blind, and accidentally recorded dark-green cars equally being blue. In order to correct the data gear up, let'southward supervene upon 'Blue' with 'Green' in the $Colour cavalcade:

                          # Hither we will use R'southward `ifelse` function, in which nosotros provide the test phrase,                                          # the upshot if the effect of the test is 'TRUE', and the issue if the                                          # result is 'Imitation'. Nosotros will also assign the results to the Colour cavalcade,                                          # using '<-'                                          # First - reload the data with a header                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          TRUE              )                                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Colour                                          ==                                          'Bluish'              ,                                          'Green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [ane] "Green" "1"     "Green" "5"     "4"     "Green" "Green" "ii"     "five"      [x] "4"     "4"     "five"     "Green" "Green" "ii"     "4"     "Dark-green" "Green"  [19] "5"     "Dark-green" "Green" "Greenish" "4"     "Green" "4"     "4"     "four"      [28] "4"     "5"     "Green" "iv"     "5"     "2"     "4"     "2"     "2"      [37] "Dark-green" "4"     "2"     "iv"     "2"     "2"     "4"     "iv"     "five"      [46] "two"     "Light-green" "4"     "four"     "two"     "two"     "iv"     "v"     "iv"      [55] "Greenish" "Green" "2"     "Dark-green" "v"     "two"     "4"     "Green" "Green"  [64] "5"     "2"     "4"     "4"     "two"     "Green" "v"     "Greenish" "4"      [73] "v"     "v"     "Green" "Green" "Green" "Green" "Greenish" "5"     "2"      [82] "Green" "five"     "2"     "2"     "4"     "4"     "5"     "5"     "v"      [91] "5"     "4"     "four"     "4"     "v"     "2"     "5"     "2"     "2"     [100] "5"                      

What happened?!? It looks similar 'Blue' was replaced with 'Light-green', but every other colour was turned into a number (every bit a character string, given the quote marks before and after). This is because the colors of the cars were loaded equally factors, and the factor level was reported following replacement.

To see the internal structure, we can use another office, str(). In this instance, the dataframe'due south internal structure includes the format of each column, which is what we are interested in. str() volition be reviewed a little more than in the lesson Data Types and Structures.

                          # Reload the information with a header (the previous ifelse call modifies attributes)                                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'information/automobile-speeds.csv'              ,                                          stringsAsFactors                                          =                                          True              )                                          str              (              carSpeeds              )                                                  
            'information.frame':	100 obs. of  3 variables:  $ Colour: Factor w/ five levels " Ruby","Black",..: 3 ane 3 five 4 3 three 2 5 4 ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: Factor w/ 4 levels "Arizona","Colorado",..: 3 1 ii 1 i one 3 2 ane 2 ...                      

Nosotros can see that the $Color and $State columns are factors and $Speed is a numeric column.

Now, let'south load the dataset using stringsAsFactors=FALSE, and see what happens when we try to supercede 'Blue' with 'Green' in the $Color cavalcade:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          stringsAsFactors                                          =                                          FALSE              )                                          str              (              carSpeeds              )                                                  
            'information.frame':	100 obs. of  3 variables:  $ Colour: chr  "Blue" " Red" "Bluish" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: chr  "NewMexico" "Arizona" "Colorado" "Arizona" ...                      
                          carSpeeds              $              Color                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blueish'              ,                                          'Dark-green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [i] "Greenish" " Red"  "Dark-green" "White" "Red"   "Green" "Light-green" "Black" "White"  [ten] "Ruby-red"   "Carmine"   "White" "Dark-green" "Green" "Black" "Red"   "Green" "Green"  [19] "White" "Green" "Light-green" "Greenish" "Ruby"   "Green" "Carmine"   "Red"   "Cerise"    [28] "Red"   "White" "Light-green" "Scarlet"   "White" "Blackness" "Red"   "Black" "Black"  [37] "Green" "Red"   "Black" "Red"   "Blackness" "Black" "Red"   "Red"   "White"  [46] "Black" "Green" "Reddish"   "Red"   "Black" "Black" "Carmine"   "White" "Cherry-red"    [55] "Green" "Green" "Black" "Green" "White" "Black" "Red"   "Light-green" "Green"  [64] "White" "Blackness" "Red"   "Red"   "Black" "Green" "White" "Green" "Ruddy"    [73] "White" "White" "Light-green" "Green" "Green" "Green" "Dark-green" "White" "Blackness"  [82] "Green" "White" "Black" "Black" "Blood-red"   "Red"   "White" "White" "White"  [91] "White" "Ruby"   "Crimson"   "Red"   "White" "Black" "White" "Blackness" "Blackness" [100] "White"                      

That'south ameliorate! And we tin encounter how the information at present is read as character instead of factor. From R version 4.0 onwards we exercise not have to specify stringsAsFactors=FALSE, this is the default behavior.

The as.is Argument

This is an extension of the stringsAsFactors statement, but gives you control over individual columns. For instance, if we want the colors of cars imported as strings, but nosotros want the names of the states imported as factors, we would load the information set as:

                          carSpeeds                                          <-                                          read.csv              (              file                                          =                                          'data/car-speeds.csv'              ,                                          as.is                                          =                                          1              )                                          # Note, the 1 applies as.is to the outset column but                                                  

At present we tin can encounter that if nosotros try to supervene upon 'Blue' with 'Green' in the $Colour column everything looks fine, while trying to supercede 'Arizona' with 'Ohio' in the $Country cavalcade returns the factor numbers for the names of states that we haven't replaced:

            'information.frame':	100 obs. of  iii variables:  $ Colour: chr  "Bluish" " Red" "Blue" "White" ...  $ Speed: int  32 45 35 34 25 41 34 29 31 26 ...  $ State: Factor w/ 4 levels "Arizona","Colorado",..: 3 1 2 ane one 1 3 ii 1 2 ...                      
                          carSpeeds              $              Colour                                          <-                                          ifelse              (              carSpeeds              $              Color                                          ==                                          'Blueish'              ,                                          'Dark-green'              ,                                          carSpeeds              $              Color              )                                          carSpeeds              $              Color                                                  
                          [1] "Green" " Red"  "Green" "White" "Red"   "Green" "Dark-green" "Black" "White"  [x] "Red"   "Red"   "White" "Light-green" "Green" "Blackness" "Red"   "Green" "Green"  [nineteen] "White" "Light-green" "Green" "Green" "Ruby-red"   "Light-green" "Red"   "Cherry-red"   "Red"    [28] "Ruby"   "White" "Green" "Carmine"   "White" "Black" "Blood-red"   "Black" "Black"  [37] "Dark-green" "Red"   "Black" "Red"   "Black" "Black" "Crimson"   "Red"   "White"  [46] "Blackness" "Green" "Red"   "Cherry"   "Black" "Blackness" "Red"   "White" "Red"    [55] "Green" "Dark-green" "Black" "Green" "White" "Black" "Red"   "Light-green" "Dark-green"  [64] "White" "Black" "Ruby-red"   "Crimson"   "Black" "Light-green" "White" "Green" "Cherry"    [73] "White" "White" "Green" "Green" "Green" "Green" "Green" "White" "Black"  [82] "Green" "White" "Black" "Black" "Red"   "Red"   "White" "White" "White"  [91] "White" "Red"   "Red"   "Ruddy"   "White" "Black" "White" "Black" "Black" [100] "White"                      
                          carSpeeds              $              State                                          <-                                          ifelse              (              carSpeeds              $              State                                          ==                                          'Arizona'              ,                                          'Ohio'              ,                                          carSpeeds              $              State              )                                          carSpeeds              $              State                                                  
                          [one] "3"    "Ohio" "two"    "Ohio" "Ohio" "Ohio" "three"    "2"    "Ohio" "two"     [11] "4"    "4"    "four"    "four"    "iv"    "three"    "Ohio" "iii"    "Ohio" "iv"     [21] "4"    "4"    "iii"    "2"    "2"    "3"    "ii"    "4"    "2"    "iv"     [31] "3"    "2"    "ii"    "4"    "2"    "two"    "three"    "Ohio" "4"    "2"     [41] "2"    "3"    "Ohio" "4"    "Ohio" "2"    "three"    "3"    "3"    "ii"     [51] "Ohio" "4"    "4"    "Ohio" "3"    "ii"    "4"    "ii"    "4"    "four"     [61] "4"    "2"    "iii"    "2"    "3"    "2"    "3"    "Ohio" "iii"    "4"     [71] "4"    "2"    "Ohio" "four"    "2"    "2"    "2"    "Ohio" "3"    "Ohio"  [81] "4"    "2"    "two"    "Ohio" "Ohio" "Ohio" "4"    "Ohio" "4"    "iv"     [91] "4"    "Ohio" "Ohio" "three"    "2"    "2"    "iv"    "3"    "Ohio" "iv"                      

Nosotros tin see that $Color column is a character while $Country is a factor.

Updating Values in a Factor

Suppose nosotros want to keep the colors of cars equally factors for another operations we desire to perform. Write code for replacing 'Blue' with 'Green' in the $Color column of the cars dataset without importing the data with stringsAsFactors=FALSE.

Solution

                                  carSpeeds                                                      <-                                                      read.csv                  (                  file                                                      =                                                      'data/machine-speeds.csv'                  )                                                      # Supplant 'Blue' with 'Green' in cars$Color without using the stringsAsFactors                                                      # or as.is arguments                                                      carSpeeds                  $                  Colour                                                      <-                                                      ifelse                  (                  as.grapheme                  (                  carSpeeds                  $                  Colour                  )                                                      ==                                                      'Blue'                  ,                                                      'Greenish'                  ,                                                      as.character                  (                  carSpeeds                  $                  Colour                  ))                                                      # Catechumen colors back to factors                                                      carSpeeds                  $                  Color                                                      <-                                                      as.factor                  (                  carSpeeds                  $                  Color                  )                                                                  

The strip.white Argument

It is not uncommon for mistakes to accept been made when the data were recorded, for example a space (whitespace) may have been inserted before a data value. Past default this whitespace volition exist kept in the R environs, such that '\ Red' will exist recognized as a unlike value than 'Red'. In order to avoid this type of error, use the strip.white argument. Let's see how this works by checking for the unique values in the $Colour column of our dataset:

Here, the data recorder added a infinite before the color of the car in one of the cells:

                          # We use the built-in unique() part to extract the unique colors in our dataset                                          unique              (              carSpeeds              $              Color              )                                                  
            [ane] Dark-green  Red  White Red   Black Levels:  Ruby Black Dark-green Cherry-red White                      

Oops, we see two values for red cars.

Permit's try again, this time importing the information using the strip.white statement. Note - this argument must be accompanied by the sep argument, by which we indicate the blazon of delimiter in the file (the comma for most .csv files)

                          carSpeeds                                          <-                                          read.csv              (                                          file                                          =                                          'data/motorcar-speeds.csv'              ,                                          stringsAsFactors                                          =                                          Fake              ,                                          strip.white                                          =                                          Truthful              ,                                          sep                                          =                                          ','                                          )                                          unique              (              carSpeeds              $              Color              )                                                  
            [i] "Blue"  "Red"   "White" "Black"                      

That's ameliorate!

Specify Missing Data When Loading

It is common for data sets to have missing values, or mistakes. The convention for recording missing values oftentimes depends on the individual who collected the data and can exist recorded equally n.a., --, or empty cells " ". R recognises the reserved character string NA as a missing value, but not some of the examples above. Allow's say the inflamation scale in the data gear up we used before inflammation-01.csv actually starts at 1 for no inflamation and the zero values (0) were a missed observation. Looking at the ?read.csv help folio is there an argument we could apply to ensure all zeros (0) are read in equally NA? Possibly, in the car-speeds.csv data contains mistakes and the person measuring the automobile speeds could not accurately distinguish betwixt "Black or "Blue" cars. Is there a way to specify more than ane 'cord', such every bit "Black" and "Blue", to be replaced by NA

Solution

                                  read.csv                  (                  file                                                      =                                                      "data/inflammation-01.csv"                  ,                                                      na.strings                                                      =                                                      "0"                  )                                                                  

or , in car-speeds.csv use a grapheme vector for multiple values.

                                  read.csv                  (                                                      file                                                      =                                                      'data/car-speeds.csv'                  ,                                                      na.strings                                                      =                                                      c                  (                  "Black"                  ,                                                      "Blue"                  )                                                      )                                                                  

Write a New .csv and Explore the Arguments

After altering our cars dataset by replacing 'Blue' with 'Green' in the $Colour cavalcade, we at present want to save the output. There are several arguments for the write.csv(...) function call, a few of which are peculiarly important for how the data are exported. Let's explore these now.

                          # Export the data. The write.csv() function requires a minimum of two                                          # arguments, the data to exist saved and the name of the output file.                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/car-speeds-cleaned.csv'              )                                                  

If you open the file, you'll see that it has header names, because the data had headers within R, simply that in that location are numbers in the starting time cavalcade.

csv written without row.names argument

The row.names Argument

This argument allows u.s. to gear up the names of the rows in the output data file. R's default for this statement is TRUE, and since it does non know what else to proper noun the rows for the cars data fix, it resorts to using row numbers. To correct this, we can set row.names to FALSE:

                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/machine-speeds-cleaned.csv'              ,                                          row.names                                          =                                          FALSE              )                                                  

Now we see:

csv written with row.names argument

Setting Cavalcade Names

In that location is also a col.names statement, which tin can be used to set the column names for a data set without headers. If the data set already has headers (e.g., we used the headers = True argument when importing the information) then a col.names argument will be ignored.

The na Argument

There are times when we want to specify certain values for NAdue south in the information set (e.g., we are going to laissez passer the data to a program that only accepts -9999 equally a nodata value). In this case, we desire to set the NA value of our output file to the desired value, using the na argument. Let'due south encounter how this works:

                          # Commencement, replace the speed in the 3rd row with NA, past using an alphabetize (square                                          # brackets to indicate the position of the value we want to replace)                                          carSpeeds              $              Speed              [              3              ]                                          <-                                          NA                                          caput              (              carSpeeds              )                                                  
                          Color Speed     State 1  Blue    32 NewMexico 2   Red    45   Arizona three  Blue    NA  Colorado four White    34   Arizona 5   Red    25   Arizona half-dozen  Blue    41   Arizona                      
                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/motorcar-speeds-cleaned.csv'              ,                                          row.names                                          =                                          FALSE              )                                                  

Now nosotros'll set NA to -9999 when we write the new .csv file:

                          # Note - the na argument requires a string input                                          write.csv              (              carSpeeds              ,                                          file                                          =                                          'data/automobile-speeds-cleaned.csv'              ,                                          row.names                                          =                                          FALSE              ,                                          na                                          =                                          '-9999'              )                                                  

And we see:

csv written with -9999 as NA

Cardinal Points

  • Import data from a .csv file using the read.csv(...) office.

  • Understand some of the key arguments available for importing the data properly, including header, stringsAsFactors, as.is, and strip.white.

  • Write data to a new .csv file using the write.csv(...) role

  • Understand some of the key arguments available for exporting the data properly, such every bit row.names, col.names, and na.

eastwrible.blogspot.com

Source: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/

0 Response to "Read Csv and Convert First Column as Row Names in R"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel