2014년 9월 9일 화요일

Getting and Cleaning Data Quiz1

Coursera의 "Getting and Cleaning Data" Quiz 1.
단순한 퀴즈와 짧은 코드지만 기록을 위해 남겨둔다.

Question 1
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names is here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf 
How many properties are worth $1,000,000 or more?
A1. 
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
download.file(fileUrl, destfile="./quiz1.csv")
list.files("./")
d <- read.table("./quiz1.csv", sep=",", header=TRUE, na.string=0)
head(d)
d[d$VAL > 23 & !is.na(d$VAL), "VAL"]

Question 3

Download the Excel spreadsheet on Natural Gas Aquisition Program here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx 

Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:
 dat 
What is the value of:
 sum(dat$Zip*dat$Ext,na.rm=T) 
(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program)

A3.
install.packages("xlsx")
library(xlsx)
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx"
download.file(fileUrl, destfile="./quiz2.xlsx")

colIndex <- 7:15
rowIndex <- 18:23
dat <- read.xlsx("./quiz2.xlsx", sheetIndex=1, colIndex=colIndex, rowIndex=rowIndex)
sum(dat$Zip*dat$Ext,na.rm=T) 

Question 4

Read the XML data on Baltimore restaurants from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml 

How many restaurants have zipcode 21231?

A4.
install.packages("XML")
library(XML)
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
download.file(fileUrl, destfile="./quiz4.xml")
doc <- xmlTreeParse("./quiz4.xml", useInternal=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
zipcode <- xpathSApply(rootNode, "//zipcode", xmlValue)
zipcode[zipcode=="21231"]

Question 5

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

using the fread() command load the data into an R object
 DT 
Which of the following is the fastest way to calculate the average value of the variable
pwgtp15 
broken down by sex using the data.table package?

A5. 
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv "
download.file(fileUrl, destfile="./quiz5.csv")
DT <- read.table("./quiz5.csv", sep=",", header=TRUE)
head(DT)
sapply(split(DT$pwgtp15,DT$SEX),mean)