标签:
经过断断续续一个月的学习,R语言这门课也快接近尾声了。进入Week 4,作业对于我这个初学者来说感到越发困难起来。还好经过几天不断地摸索和试错,最终完整地解决了问题。
本周的作业Assignment 3是处理一个来自美国Department of Health and Human Services的一个文件,叫“outcome-of-care-measures.csv”。里面储存了美国50个州4000多家医院的几个常见疾病的死亡率。具体说来是30-day mortality and readmission rates for heart attacks, heart failure, and pneumonia。然后我们的任务是能对里面州内或全国的医院按不同疾病的死亡率进行排序,从而锁定最佳医院,最差医院和排名为第N名的医院。
Finding the best hospital in a state
Write a function called best that take two arguments: the 2-character abbreviated name of a state and an outcome name. The function reads the outcome-of-care-measures.csv _le and returns a character vector with the name of the hospital that has the best (i.e. lowest) 30-day mortality for the speci_ed outcome in that state. The hospital name is the name provided in the Hospital.Name variable. The outcomes can be one of \heart attack", \heart failure", or \pneumonia". Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.
Handling ties. If there is a tie for the best hospital for a given outcome, then the hospital names should be sorted in alphabetical order and the _rst hospital in that set should be chosen (i.e. if hospitals \b", \c", and \f" are tied for best, then hospital \b" should be returned).
The function should use the following template.
best <- function(state, outcome) {
## Read outcome data
## Check that state and outcome are valid
## Return hospital name in that state with lowest 30-day death
## rate
}
The function should check the validity of its arguments. If an invalid state value is passed to best, the
function should throw an error via the stop function with the exact message \invalid state". If an invalid
outcome value is passed to best, the function should throw an error via the stop function with the exact
message \invalid outcome".
Here is some sample output from the function.
> source("best.R")
> best("TX", "heart attack")
[1] "CYPRESS FAIRBANKS MEDICAL CENTER"
> best("TX", "heart failure")
[1] "FORT DUNCAN MEDICAL CENTER"
> best("MD", "heart attack")
[1] "JOHNS HOPKINS HOSPITAL, THE"
> best("MD", "pneumonia")
[1] "GREATER BALTIMORE MEDICAL CENTER"
> best("BB", "heart attack")
Error in best("BB", "heart attack") : invalid state
> best("NY", "hert attack")
Error in best("NY", "hert attack") : invalid outcome
>
第一个函数任务叫best,任务就是当输入“州”和“疾病”时,该函数能够返回该州治疗该疾病最好的医院名。所谓“最好”,作业里已经有所介绍,就是30天期间该病死亡率最低。如果某病最佳医院的死亡率相同,则按照字母顺序对医院进行排名,字母靠前的医院优先排在前面。最终是第一名的医院被返回。
best <- function (state, outcome){ data <- read.csv("outcome-of-care-measures.csv") A <- data$State == state ## Test if the variable: state is in the list of data$State. If not, the sum of A will be 0. if (!sum(A)) { stop ("invalid state") } disease_list <- c("heart attack", "heart failure", "pneumonia") ## Test if the variable: outcome is in the list of disease if (!outcome %in% disease_list){ stop ("invalid outcome") } ## Create the sub-data.frame for the specific state. StateData <- subset(data, State == state) ## Extract the hospital and rate colume from the data. if (outcome == "heart attack") { StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")] } else if (outcome == "heart failure"){ StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")] } else { StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")] } ## Assign a common colume name for the StateData whatever the disease is. colnames(StateData) <- c("Hospital.Name", "Disease.Rate") ## Transform the disease.rate colume from Factor to numeric for the purpose of ordering. StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"])) StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"]) ## Order the data.frame by disease rate and hospital names. StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]
best <- StateData[1, "Hospital.Name"] best }
首先,读取数据文件到data中。然后就要做判断state 和outcome 是否在文件中存在。这里我用 A <- data$State == state 构建一个逻辑数组,如果state不是美国任何一个州,则 A = FALSE, FALSE, … FALSE,进而其求和 sum (A) = 0;如果state是其中一个州那么data$State中必包含一个或多个TRUE,则其总和将不等于0. 这一招也是看课件学到的,感觉以后也会很有用。
对于outcome,本来也可以依葫芦画瓢,但是感谢博客园广大博主,又学会了一招。就是R语言有个最简单的判断元素a是否在数组或者list A中的办法,即a %in% A, 返回值为逻辑类。
之后对于上面两种办法产生的结果进行if判断,决定是否抛出stop函数。
接下来用subset 将某一州state的数据提取出来。再依据outcome的结果,找到相应疾病的colume,并用StateData <- StateDate[c(“Hospital.Name”, “Hospital.30.Day…”) ]只把医院名,疾病死亡率这两列提取出来。
最核心的就是排序了。排序的话有sort, rank 和order等函数可以选用。sort(x)是对向量x进行排序,返回值排序后的数值向量。rank()是求秩的函数,它的返回值是这个向量中对应元素的“排名”。而order()的返回值是对应“排名”的元素所在向量中的位置。
下面以一小段R来感受一下:
> w <- c(97, 93, 85, 85, 32, NA, 99)
> w
[1] 97 93 85 85 32 NA 99
> order(w)
[1] 5 3 4 2 1 7 6
> w <- c(97, 93, 85, 85, 32, NA, 99)
看来NA通常被认为是最大的。
既然order ()可以返回排名后元素所在位置,那么用 A[order(A$a), ] 的模式就可以对A的第a列进行排序。而且order()还可以用A[order(A$a, A$b,…), ]对A中的多列同时排序,先排第一个出现的a。案例如下所示:
> x <- data.frame(foo = 1:8, State = c(‘TX‘,‘TX‘,‘TX‘,‘NY‘,‘NY‘,‘NY‘,‘CA‘,‘CA‘), Country = c(‘a‘,‘a‘,‘b‘,‘e‘,‘e‘,‘f‘,‘m‘,‘n‘), Site = c(1,6,1,1,3,1,8,5))
> x
foo State Country Site
1 1 TX a 1
2 2 TX a 6
3 3 TX b 1
4 4 NY e 1
5 5 NY e 3
6 6 NY f 1
7 7 CA m 8
8 8 CA n 5
> x[order(x$Site),]
foo State Country Site
1 1 TX a 1
3 3 TX b 1
4 4 NY e 1
6 6 NY f 1
5 5 NY e 3
8 8 CA n 5
2 2 TX a 6
7 7 CA m 8
> x[order(x$Site, x$Country),]
foo State Country Site
1 1 TX a 1
3 3 TX b 1
4 4 NY e 1
6 6 NY f 1
5 5 NY e 3
8 8 CA n 5
2 2 TX a 6
7 7 CA m 8
那么,想对Disease.Rate列排序,同时让Hospital按字母排列,一个办法就是用如下的order函数
StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]
但是问题又来了,如果直接这样排,发现排出来的是错的。比如我输入WI这个州,pneumonia这个病。按代码运行到StateData并排完序结果如下。
StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]
> StateData
Hospital.Name Disease.Rate
1897 CALVERT MEMORIAL HOSPITAL 10.1
1902 HOWARD COUNTY GENERAL HOSPITAL 10.1
1875 JOHNS HOPKINS HOSPITAL, THE 10.2
1906 LAUREL REGIONAL MEDICAL CENTER 10.6
1895 MEMORIAL HOSPITAL AT EASTON 10.6
1883 PENINSULA REGIONAL MEDICAL CENTER 10.6
1889 JOHNS HOPKINS BAYVIEW MEDICAL CENTER 10.7
1910 ATLANTIC GENERAL HOSPITAL 10.8
1896 MARYLAND GENERAL HOSPITAL 10.8
1904 DOCTORS‘ COMMUNITY HOSPITAL 11.0
1909 FORT WASHINGTON HOSPITAL 11.0
1880 WASHINGTON ADVENTIST HOSPITAL 11.0
1876 SAINT AGNES HOSPITAL 11.1
1890 CHESTER RIVER HOSPITAL CENTER 11.2
1874 MERCY MEDICAL CENTER INC 11.2
1886 MEDSTAR UNION MEMORIAL HOSPITAL 11.3
1872 HARFORD MEMORIAL HOSPITAL 11.5
1908 SHADY GROVE ADVENTIST HOSPITAL 11.7
1905 SOUTHERN MARYLAND HOSPITAL CENTER 11.7
1885 ANNE ARUNDEL MEDICAL CENTER 12.0
1867 MERITUS MEDICAL CENTER 12.5
1898 NORTHWEST HOSPITAL CENTER 12.6
1911 VA MARYLAND HEALTHCARE SYSTEM - BALTIMORE 12.6
1887 WESTERN MARYLAND REGIONAL MEDICAL CENTER 12.6
1899 BALTIMORE WASHINGTON MEDICAL CENTER 12.7
1868 UNIVERSITY OF MARYLAND MEDICAL CENTER 12.7
1901 EDWARD MCCREADY MEMORIAL HOSPITAL 12.9
1903 UPPER CHESAPEAKE MEDICAL CENTER 12.9
1869 PRINCE GEORGES HOSPITAL CENTER 13.0
1888 MEDSTAR SAINT MARY‘S HOSPITAL 13.1
1881 GARRETT COUNTY MEMORIAL HOSPITAL 13.5
1894 CIVISTA MEDICAL CENTER 14.2
1900 GREATER BALTIMORE MEDICAL CENTER 7.4
1907 MEDSTAR GOOD SAMARITAN HOSPITAL 8.4
1893 MEDSTAR HARBOR HOSPITAL 9.2
1879 MEDSTAR FRANKLIN SQUARE MEDICAL CENTER 9.3
1882 MEDSTAR MONTGOMERY MEDICAL CENTER 9.3
1873 SAINT JOSEPH MEDICAL CENTER 9.5
1878 BON SECOURS HOSPITAL 9.6
1870 HOLY CROSS HOSPITAL 9.6
1892 CARROLL HOSPITAL CENTER 9.7
1877 SINAI HOSPITAL OF BALTIMORE 9.7
1871 FREDERICK MEMORIAL HOSPITAL 9.8
1884 SUBURBAN HOSPITAL 9.9
1891 UNION HOSPITAL OF CECIL COUNTY 9.9
有意思的是本来1900行的GREATER BALTIMORE MEDICAL CENTER的死亡率最低为7.4,应该排第一。但是order函数却把大于10的先排了,小于10的再在后面单独排列,导致7.4被甩到后面去了。问题排查出来是StateData$Disease.Rate得到是Factor类,而非由numeric类构成的数组。所以这可能是排序无法进行的原因。同样地,StateData$Hospital.Name也不是character类型,而是Factor类。所以解决方案就是对Factor进行转码。StateData$Disease.Rate由Factor转成numeric类, StateData$Hospital.Name则由Facter转成character类型。
下面是个小例子。
> x
foo State Country Site
1 1 TX a 1
2 2 TX a 6
3 3 TX b 1
4 4 NY e 1
5 5 NY e 3
6 6 NY f 1
7 7 CA m 8
8 8 CA n 5
> str(x$State)
Factor w/ 3 levels "CA","NY","TX": 3 3 3 2 2 2 1 1
> str(as.character(x$State))
chr [1:8] "TX" "TX" "TX" "NY" "NY" "NY" "CA" "CA"
其中由Factor转numeric要先转成character.
所以在应用order这个函数前,一定要注意数列是否是你想排列的那个类型。
也因此我在前面加上了两句。
StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"]))
StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"])
最后只消返回第一个Hospital就是我们想要的最佳医院了。
Ranking hospitals by outcome in a state Write a function called rankhospital that takes three arguments: the 2-character abbreviated name of a
state (state), an outcome (outcome), and the ranking of a hospital in that state for that outcome (num).
The function reads the outcome-of-care-measures.csv le and returns a character vector with the name
of the hospital that has the ranking specied by the num argument. For example, the call
rankhospital("MD", "heart failure", 5)
would return a character vector containing the name of the hospital with the 5th lowest 30-day death rate
for heart failure.
Here is some sample output from the function.
> source("rankhospital.R")
> rankhospital("TX", "heart failure", 4)
[1] "DETAR HOSPITAL NAVARRO"
> rankhospital("MD", "heart attack", "worst")
[1] "HARFORD MEMORIAL HOSPITAL"
> rankhospital("MN", "heart attack", 5000)
[1] NA
第二个函数叫rankhospital, 相比第一个函数提出了更多的要求,就是输入input除了州名,疾病名,还有排名num。函数要能给出该州该病排名第num的医院名。如果排名超出医院总数,返回NA,如果有医院某病的死亡率相同,按照字母先后顺序对医院进行排名,字母靠前的医院优先返回。
rankhospital <- function (state, outcome, num = "best"){
data <- read.csv("outcome-of-care-measures.csv")
A <- data$State == state
if (!sum(A)) {
stop ("invalid state")
}
disease_list <- c("heart attack", "heart failure", "pneumonia")
if (!outcome %in% disease_list){
stop ("invalid outcome")
}
StateData <- subset(data, State == state)
if (outcome == "heart attack") {
StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")]
}
else if (outcome == "heart failure"){
StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")]
}
else {
StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")]
}
colnames(StateData) <- c("Hospital.Name", "Disease.Rate")
StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"]))
StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"])
StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]
## Specify the exact value for num input.
N <- sum(!is.na(StateData$Disease.Rate))
if (num == "best"){
num <- 1
}
else if (num == "worst"){
num <- N
}
else{}
Hospital <- StateData[num, "Hospital.Name"]
Hospital
}
和best 函数相比,这里多出来的部分就是对num进行判断。Num可以是1,最后一名,可以是某一名次。值得注意的是,worst并非对应某州的所有医院的最后一位,因为有相当多的医院没有提供死亡率,也就是NA。看作业的示例结果,貌似NA是不参与评比的,因而worst只对应有数据的最后一位。这样就需要我们计算一下Disease.Rate中一共有多少个NA数据。
这里我用N <- sum(!is.na(StateData$Disease.Rate))来计算,直接了当。
至于num超过医院总数的情况可以不必理会,因为读取时R找不到对应行时会自动返回NA。
Ranking hospitals in all states
Write a function called rankall that takes two arguments: an outcome name (outcome) and a hospital rank-
ing (num). The function reads the outcome-of-care-measures.csv le and returns a 2-column data frame
containing the hospital in each state that has the ranking specied in num. For example the function call
rankall("heart attack", "best") would return a data frame containing the names of the hospitals that
are the best in their respective states for 30-day heart attack death rates. The function should return a value
for every state (some may be NA). The rst column in the data frame is named hospital, which contains
the hospital name, and the second column is named state, which contains the 2-character abbreviation for
the state name. Hospitals that do not have data on a particular outcome should be excluded from the set of
hospitals when deciding the rankings.
第三个任务是rankall函数,要求是不关心是哪个州,只要指定疾病和排名,就要返回一个数据框,里面存储着所有州该疾病该排名的医院名。
1 rankall <- function (outcome, num = "best"){
2 data <- read.csv("outcome-of-care-measures.csv")
3
4 disease_list <- c("heart attack", "heart failure", "pneumonia")
5
6 if (!outcome %in% disease_list){
7 stop ("invalid outcome")
8 }
9
10 ## Extract the hospital and rate colume from the data.
11 if (outcome == "heart attack") {
12 data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")]
13 }
14 else if (outcome == "heart failure"){
15 data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")]
16 }
17 else {
18 data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")]
19 }
20 ## Assign a common colume name for the StateData whatever the disease is.
21 colnames(data)[3] <- "Disease.Rate"
22
23 ## Transform the disease.rate colume from Factor to numeric for the purpose of ordering.So does the hospital colume.
24 data[, "Disease.Rate"] <- as.numeric(as.character(data[, "Disease.Rate"]))
25 data[, "Hospital.Name"] <- as.character(data[, "Hospital.Name"])
26
27 ## Create a list to store all of the state names in US, and order it alphabetically.
28 Statelist <- as.character(unique(data$State))
29 Statelist <- Statelist[order(Statelist)]
30
31 Final <- data.frame()
32
33 for (i in seq_len(length(Statelist))){
34 ## Create the sub-data.frame for the specific state.
35 StateData <- subset(data, State == Statelist[i])
36
37 ## Order the data.frame by disease rate and hospital names.
38 StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]
39
40 ## Specify the exact value for num input.
41 N <- sum(!is.na(StateData$Disease.Rate))
42 if (num == "best"){
43 num <- 1
44 }
45 else if (num == "worst"){
46 num <- N
47 }
48 else{}
49
50 Hospital <- StateData[num, "Hospital.Name"]
51 tmp <- data.frame(Hospital, Statelist[i]) # Create each row for the final data.frame.
52 colnames(tmp) <- c("hospital", "state")
53 Final <- rbind(Final, tmp)
54 }
55
56 Final
57 }
这个函数要求返回一个data.frame,而不是简单医院的名称。那么就需要分别读取每个州的数据,然后进行上一个函数的操作,最后再把州和相应的医院名整合到这个data.frame中。在读取每个州的操作和前面的都如出一辙。只是因为要历遍每一个州,需要用到循环。这里我构建一个list保存唯一的每个州的州名,用的是Statelist <- as.character(unique(data$State)) 这个操作。其中unique表示把重复元素剔除,只保留不重复的唯一元素,但要注意格式,剔除以后可能也需要转码。
这样三个函数写完,大部分题目都测试正确。但是在判断“worst”时候,有时会出现问题。做细节排查,发现R在我写的程序里判断 else if (num == "worst") 时有时会报错:
> if(num == “best”){num<-1} else if(num == "worst"){num <- N} else {}
Error: unexpected input in "if(num == ?
希望日后能再搞清楚。
最后得吐槽一下Coursera改版以后的界面用户体验真实差到shit一样。
总之,从开始的试试看到发现R真的很好玩。同时这又是我在博客园上的第一篇博文。希望以后也像其他大牛们一样坚持写博客,并会继续学习R下去的~~
Coursera系列-R Programming (John Hopkins University)-Programming Assignment 3
标签:
原文地址:http://www.cnblogs.com/yifeili/p/5437384.html