Tutorials

Glossary for Working with Data sets

This is a companion glossary for a previous post on working with large data sets.  Its purpose is to highlight the relevant arguments for dealing with reading and working with large data sets.

  1. Loading the data file:
    1. Set directory
      •  Standard form
        • setwd(“…”)
      • Mac
      • Windows
      •  Simplified
          • this replaces “C:/Users/UserName/Documents”
      • Additional information
    2. Read file
      • read table
        • read.table(‘file’, header=TRUE, sep = “ ”)
          • Works with .txt and .csv
            • ex: 
      • read csv
        •  read.csv(‘file’)
          • functions the same as read.table but specifically for .csv files
            • ex: 
      • read excel file
        • read.xls(‘file’, sheetname = ‘sheet’)
          • similar to read.csv but for .xlsx files
          • need to load package that can read .xlsx files
          • need to specify tab being read in
            • ex: 
      • Choose file
        • read.csv(file.choose())
          • to choose files manually
      • Scan
        • scan(‘file’, what =, sep = “ “)
          • scan file should not be used for larger data sets
      • Additional information
      • Extensive descriptions of reading data
  2. Cleaning up the data set:
    1. Blank values
      • NA values
        • na.omit(data)
          • ex: 
          • This creates a new data set identical to the original without the NA values
        • na.exclude(data)
          • Functions similar to na.omit(data)
          • Both return object with rows containing NAs removed
        • na.fail(data)
          • Checks for NAs and returns the tested object if none are found
        • na.pass(data)
          • Passes over NAs to return the object unchanged
      • Additional information
      • More information with examples
    2. Converting factors an numerics
      • as.character(data)
      • as.numeric(data)
        • converts all factors or characters into a numeric vector
        • same function as as.double and as.real
          • ex: 
        • Additional information
  3. Using the data:
    1. plotting
      • Basic plot function
        • plot(x,y)
      • Other useful arguments inside the plot function
        • type = ” ”
          • plot type
          • ie “l” for line plot
        • main = ‘ ‘
          • main plot title
        • xlab = ‘ ‘
          • x-axis title
        • ylab = ‘ ‘
          • y-axis title
        • col = ” ”
          • line color
          • ie. “blue”
        • lwd = <num>
          • linewidth/thichkness
      • Arguments outside plot function
        • grid()
          • turns on grid lines for plot
        • par(new = T/F)
          • for plotting multiple lines or data on one plot
          • T if plotting another line after/F if not
      • ex:
      • Additional information
Advertisements

One thought on “Glossary for Working with Data sets

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s