码迷,mamicode.com
首页 > 其他好文 > 详细

Coursera系列-R Programming (John Hopkins University)-Programming Assignment 3

时间:2016-04-27 12:36:43      阅读:748      评论:0      收藏:0      [点我收藏+]

标签:

经过断断续续一个月的学习,R语言这门课也快接近尾声了。进入Week 4,作业对于我这个初学者来说感到越发困难起来。还好经过几天不断地摸索和试错,最终完整地解决了问题。

本周的作业Assignment 3是处理一个来自美国Department of Health and Human Services的一个文件,叫“outcome-of-care-measures.csv”。里面储存了美国50个州4000多家医院的几个常见疾病的死亡率。具体说来是30-day mortality and readmission rates for heart attacks, heart failure, and pneumonia。然后我们的任务是能对里面州内或全国的医院按不同疾病的死亡率进行排序,从而锁定最佳医院,最差医院和排名为第N名的医院。

Task 1

Finding the best hospital in a state

Write a function called best that take two arguments: the 2-character abbreviated name of a state and an outcome name. The function reads the outcome-of-care-measures.csv _le and returns a character vector with the name of the hospital that has the best (i.e. lowest) 30-day mortality for the speci_ed outcome in that state. The hospital name is the name provided in the Hospital.Name variable. The outcomes can be one of \heart attack", \heart failure", or \pneumonia". Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.

Handling ties. If there is a tie for the best hospital for a given outcome, then the hospital names should be sorted in alphabetical order and the _rst hospital in that set should be chosen (i.e. if hospitals \b", \c", and \f" are tied for best, then hospital \b" should be returned).

The function should use the following template.

best <- function(state, outcome) {

## Read outcome data

## Check that state and outcome are valid

## Return hospital name in that state with lowest 30-day death

## rate

}

The function should check the validity of its arguments. If an invalid state value is passed to best, the

function should throw an error via the stop function with the exact message \invalid state". If an invalid

outcome value is passed to best, the function should throw an error via the stop function with the exact

message \invalid outcome".

Here is some sample output from the function.

> source("best.R")

> best("TX", "heart attack")

[1] "CYPRESS FAIRBANKS MEDICAL CENTER"

> best("TX", "heart failure")

[1] "FORT DUNCAN MEDICAL CENTER"

> best("MD", "heart attack")

[1] "JOHNS HOPKINS HOSPITAL, THE"

> best("MD", "pneumonia")

[1] "GREATER BALTIMORE MEDICAL CENTER"

> best("BB", "heart attack")

Error in best("BB", "heart attack") : invalid state

> best("NY", "hert attack")

Error in best("NY", "hert attack") : invalid outcome

第一个函数任务叫best,任务就是当输入“州”和“疾病”时,该函数能够返回该州治疗该疾病最好的医院名。所谓“最好”,作业里已经有所介绍,就是30天期间该病死亡率最低。如果某病最佳医院的死亡率相同,则按照字母顺序对医院进行排名,字母靠前的医院优先排在前面。最终是第一名的医院被返回。

best <- function (state, outcome){
  data <- read.csv("outcome-of-care-measures.csv")
  A <- data$State == state
  ## Test if the variable: state is in the list of data$State. If not, the sum of A will be 0.
  if (!sum(A)) {
    stop ("invalid state")
  }
  
  disease_list <- c("heart attack", "heart failure", "pneumonia")
  ## Test if the variable: outcome is in the list of disease
  if (!outcome %in% disease_list){
    stop ("invalid outcome")
  }
  ## Create the sub-data.frame for the specific state.
  StateData <- subset(data, State == state)
  ## Extract the hospital and rate colume from the data.
  if (outcome == "heart attack") { 
    StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")]
  }
  else if (outcome == "heart failure"){
    StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")]
  }
  else {
    StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")]
  }
  
  ## Assign a common colume name for the StateData whatever the disease is.
  colnames(StateData) <- c("Hospital.Name", "Disease.Rate")
  
  ## Transform the disease.rate colume from Factor to numeric for the purpose of ordering. 
  StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"]))
  StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"])
 
  ## Order the data.frame by disease rate and hospital names.
  StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]
best <- StateData[1, "Hospital.Name"] best
}

首先,读取数据文件到data中。然后就要做判断state 和outcome 是否在文件中存在。这里我用 A <- data$State == state 构建一个逻辑数组,如果state不是美国任何一个州,则 A = FALSE, FALSE, … FALSE,进而其求和 sum (A) = 0;如果state是其中一个州那么data$State中必包含一个或多个TRUE,则其总和将不等于0. 这一招也是看课件学到的,感觉以后也会很有用。

对于outcome,本来也可以依葫芦画瓢,但是感谢博客园广大博主,又学会了一招。就是R语言有个最简单的判断元素a是否在数组或者list A中的办法,即a %in% A, 返回值为逻辑类。

之后对于上面两种办法产生的结果进行if判断,决定是否抛出stop函数。

接下来用subset 将某一州state的数据提取出来。再依据outcome的结果,找到相应疾病的colume,并用StateData <- StateDate[c(“Hospital.Name”, “Hospital.30.Day…”) ]只把医院名,疾病死亡率这两列提取出来。

最核心的就是排序了。排序的话有sort, rank 和order等函数可以选用。sort(x)是对向量x进行排序,返回值排序后的数值向量。rank()是求秩的函数,它的返回值是这个向量中对应元素的“排名”。而order()的返回值是对应“排名”的元素所在向量中的位置。
下面以一小段R来感受一下:

> w <- c(97, 93, 85, 85, 32, NA, 99)

> w

[1] 97 93 85 85 32 NA 99

> order(w)

[1] 5 3 4 2 1 7 6

> w <- c(97, 93, 85, 85, 32, NA, 99)

看来NA通常被认为是最大的。

既然order ()可以返回排名后元素所在位置,那么用 A[order(A$a), ] 的模式就可以对A的第a列进行排序。而且order()还可以用A[order(A$a, A$b,…), ]对A中的多列同时排序,先排第一个出现的a。案例如下所示:

> x <- data.frame(foo = 1:8, State = c(‘TX‘,‘TX‘,‘TX‘,‘NY‘,‘NY‘,‘NY‘,‘CA‘,‘CA‘), Country = c(‘a‘,‘a‘,‘b‘,‘e‘,‘e‘,‘f‘,‘m‘,‘n‘), Site = c(1,6,1,1,3,1,8,5))

> x

  foo State Country Site

1   1    TX       a    1

2   2    TX       a    6

3   3    TX       b    1

4   4    NY       e    1

5   5    NY       e    3

6   6    NY       f    1

7   7    CA       m    8

8   8    CA       n    5

> x[order(x$Site),]

  foo State Country Site

1   1    TX       a    1

3   3    TX       b    1

4   4    NY       e    1

6   6    NY       f    1

5   5    NY       e    3

8   8    CA       n    5

2   2    TX       a    6

7   7    CA       m    8

> x[order(x$Site, x$Country),]

  foo State Country Site

1   1    TX       a    1

3   3    TX       b    1

4   4    NY       e    1

6   6    NY       f    1

5   5    NY       e    3

8   8    CA       n    5

2   2    TX       a    6

7   7    CA       m    8

那么,想对Disease.Rate列排序,同时让Hospital按字母排列,一个办法就是用如下的order函数

StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]

但是问题又来了,如果直接这样排,发现排出来的是错的。比如我输入WI这个州,pneumonia这个病。按代码运行到StateData并排完序结果如下。

StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ] 
> StateData
                                 Hospital.Name Disease.Rate
1897                 CALVERT MEMORIAL HOSPITAL         10.1
1902            HOWARD COUNTY GENERAL HOSPITAL         10.1
1875               JOHNS HOPKINS HOSPITAL, THE         10.2
1906            LAUREL REGIONAL MEDICAL CENTER         10.6
1895               MEMORIAL HOSPITAL AT EASTON         10.6
1883         PENINSULA REGIONAL MEDICAL CENTER         10.6
1889      JOHNS HOPKINS BAYVIEW MEDICAL CENTER         10.7
1910                 ATLANTIC GENERAL HOSPITAL         10.8
1896                MARYLAND GENERAL  HOSPITAL         10.8
1904              DOCTORS‘  COMMUNITY HOSPITAL         11.0
1909                  FORT WASHINGTON HOSPITAL         11.0
1880             WASHINGTON ADVENTIST HOSPITAL         11.0
1876                      SAINT AGNES HOSPITAL         11.1
1890             CHESTER RIVER HOSPITAL CENTER         11.2
1874                  MERCY MEDICAL CENTER INC         11.2
1886           MEDSTAR UNION MEMORIAL HOSPITAL         11.3
1872                 HARFORD MEMORIAL HOSPITAL         11.5
1908            SHADY GROVE ADVENTIST HOSPITAL         11.7
1905         SOUTHERN MARYLAND HOSPITAL CENTER         11.7
1885               ANNE ARUNDEL MEDICAL CENTER         12.0
1867                    MERITUS MEDICAL CENTER         12.5
1898                 NORTHWEST HOSPITAL CENTER         12.6
1911 VA MARYLAND HEALTHCARE SYSTEM - BALTIMORE         12.6
1887  WESTERN MARYLAND REGIONAL MEDICAL CENTER         12.6
1899      BALTIMORE WASHINGTON  MEDICAL CENTER         12.7
1868     UNIVERSITY OF MARYLAND MEDICAL CENTER         12.7
1901         EDWARD MCCREADY MEMORIAL HOSPITAL         12.9
1903           UPPER CHESAPEAKE MEDICAL CENTER         12.9
1869            PRINCE GEORGES HOSPITAL CENTER         13.0
1888             MEDSTAR SAINT MARY‘S HOSPITAL         13.1
1881          GARRETT COUNTY MEMORIAL HOSPITAL         13.5
1894                    CIVISTA MEDICAL CENTER         14.2
1900          GREATER BALTIMORE MEDICAL CENTER          7.4
1907           MEDSTAR GOOD SAMARITAN HOSPITAL          8.4
1893                   MEDSTAR HARBOR HOSPITAL          9.2
1879    MEDSTAR FRANKLIN SQUARE MEDICAL CENTER          9.3
1882         MEDSTAR MONTGOMERY MEDICAL CENTER          9.3
1873               SAINT JOSEPH MEDICAL CENTER          9.5
1878                      BON SECOURS HOSPITAL          9.6
1870                       HOLY CROSS HOSPITAL          9.6
1892                   CARROLL HOSPITAL CENTER          9.7
1877               SINAI HOSPITAL OF BALTIMORE          9.7
1871               FREDERICK MEMORIAL HOSPITAL          9.8
1884                         SUBURBAN HOSPITAL          9.9
1891            UNION HOSPITAL OF CECIL COUNTY          9.9

有意思的是本来1900行的GREATER BALTIMORE MEDICAL CENTER的死亡率最低为7.4,应该排第一。但是order函数却把大于10的先排了,小于10的再在后面单独排列,导致7.4被甩到后面去了。问题排查出来是StateData$Disease.Rate得到是Factor类,而非由numeric类构成的数组。所以这可能是排序无法进行的原因。同样地,StateData$Hospital.Name也不是character类型,而是Factor类。所以解决方案就是对Factor进行转码。StateData$Disease.Rate由Factor转成numeric类, StateData$Hospital.Name则由Facter转成character类型。

下面是个小例子。

> x

  foo State Country Site

1   1    TX       a    1

2   2    TX       a    6

3   3    TX       b    1

4   4    NY       e    1

5   5    NY       e    3

6   6    NY       f    1

7   7    CA       m    8

8   8    CA       n    5

> str(x$State)

 Factor w/ 3 levels "CA","NY","TX": 3 3 3 2 2 2 1 1

> str(as.character(x$State))

 chr [1:8] "TX" "TX" "TX" "NY" "NY" "NY" "CA" "CA"

其中由Factor转numeric要先转成character.

所以在应用order这个函数前,一定要注意数列是否是你想排列的那个类型。

也因此我在前面加上了两句。

StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"]))

StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"])

最后只消返回第一个Hospital就是我们想要的最佳医院了。

Task 2

Ranking hospitals by outcome in a state Write a function called rankhospital that takes three arguments: the 2-character abbreviated name of a
state (state), an outcome (outcome), and the ranking of a hospital in that state for that outcome (num).
The function reads the outcome-of-care-measures.csv le and returns a character vector with the name
of the hospital that has the ranking speci ed by the num argument. For example, the call
rankhospital("MD", "heart failure", 5)
would return a character vector containing the name of the hospital with the 5th lowest 30-day death rate
for heart failure.


Here is some sample output from the function.

> source("rankhospital.R")
> rankhospital("TX", "heart failure", 4)
[1] "DETAR HOSPITAL NAVARRO"
> rankhospital("MD", "heart attack", "worst")
[1] "HARFORD MEMORIAL HOSPITAL"
> rankhospital("MN", "heart attack", 5000)
[1] NA

第二个函数叫rankhospital, 相比第一个函数提出了更多的要求,就是输入input除了州名,疾病名,还有排名num。函数要能给出该州该病排名第num的医院名。如果排名超出医院总数,返回NA,如果有医院某病的死亡率相同,按照字母先后顺序对医院进行排名,字母靠前的医院优先返回。

rankhospital <- function (state, outcome, num = "best"){
  data <- read.csv("outcome-of-care-measures.csv")
  A <- data$State == state
  if (!sum(A)) {
    stop ("invalid state")
  }
  
  disease_list <- c("heart attack", "heart failure", "pneumonia")
  if (!outcome %in% disease_list){
    stop ("invalid outcome")
  }
  
  StateData <- subset(data, State == state)
  
if (outcome == "heart attack") { StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")] } else if (outcome == "heart failure"){ StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")] } else { StateData <- StateData[c("Hospital.Name", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")] } colnames(StateData) <- c("Hospital.Name", "Disease.Rate") StateData[, "Disease.Rate"] <- as.numeric(as.character(StateData[, "Disease.Rate"])) StateData[, "Hospital.Name"] <- as.character(StateData[, "Hospital.Name"]) StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ] ## Specify the exact value for num input. N <- sum(!is.na(StateData$Disease.Rate)) if (num == "best"){ num <- 1 } else if (num == "worst"){ num <- N } else{} Hospital <- StateData[num, "Hospital.Name"] Hospital }

和best 函数相比,这里多出来的部分就是对num进行判断。Num可以是1,最后一名,可以是某一名次。值得注意的是,worst并非对应某州的所有医院的最后一位,因为有相当多的医院没有提供死亡率,也就是NA。看作业的示例结果,貌似NA是不参与评比的,因而worst只对应有数据的最后一位。这样就需要我们计算一下Disease.Rate中一共有多少个NA数据。

这里我用N <- sum(!is.na(StateData$Disease.Rate))来计算,直接了当。

至于num超过医院总数的情况可以不必理会,因为读取时R找不到对应行时会自动返回NA

Task 3

Ranking hospitals in all states

Write a function called rankall that takes two arguments: an outcome name (outcome) and a hospital rank-
ing (num). The function reads the outcome-of-care-measures.csv le and returns a 2-column data frame
containing the hospital in each state that has the ranking speci ed in num. For example the function call
rankall("heart attack", "best") would return a data frame containing the names of the hospitals that
are the best in their respective states for 30-day heart attack death rates. The function should return a value
for every state (some may be NA). The rst column in the data frame is named hospital, which contains
the hospital name, and the second column is named state, which contains the 2-character abbreviation for
the state name. Hospitals that do not have data on a particular outcome should be excluded from the set of
hospitals when deciding the rankings.

第三个任务是rankall函数,要求是不关心是哪个州,只要指定疾病和排名,就要返回一个数据框,里面存储着所有州该疾病该排名的医院名。

 1 rankall <- function (outcome, num = "best"){
 2   data <- read.csv("outcome-of-care-measures.csv")
 3 
 4   disease_list <- c("heart attack", "heart failure", "pneumonia")
 5   
 6   if (!outcome %in% disease_list){
 7     stop ("invalid outcome")
 8   }
 9 
10   ## Extract the hospital and rate colume from the data.
11   if (outcome == "heart attack") { 
12     data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")]
13   }
14   else if (outcome == "heart failure"){
15     data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure")]
16   }
17   else {
18     data <- data[c("Hospital.Name", "State", "Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia")]
19   }
20   ## Assign a common colume name for the StateData whatever the disease is.
21   colnames(data)[3] <- "Disease.Rate"
22   
23   ## Transform the disease.rate colume from Factor to numeric for the purpose of ordering.So does the hospital colume.  
24   data[, "Disease.Rate"] <- as.numeric(as.character(data[, "Disease.Rate"]))
25   data[, "Hospital.Name"] <- as.character(data[, "Hospital.Name"])
26   
27   ## Create a list to store all of the state names in US, and order it alphabetically. 
28   Statelist <- as.character(unique(data$State))
29   Statelist <- Statelist[order(Statelist)]
30   
31   Final <- data.frame()
32   
33   for (i in seq_len(length(Statelist))){
34     ## Create the sub-data.frame for the specific state.
35     StateData <- subset(data, State == Statelist[i])
36     
37     ## Order the data.frame by disease rate and hospital names.
38     StateData <- StateData[order(StateData$Disease.Rate, StateData$Hospital.Name), ]
39     
40     ## Specify the exact value for num input. 
41     N <- sum(!is.na(StateData$Disease.Rate))
42     if (num == "best"){
43       num <- 1
44     }
45     else if (num == "worst"){
46       num <- N
47     }
48     else{}
49     
50     Hospital <- StateData[num, "Hospital.Name"]
51     tmp <- data.frame(Hospital, Statelist[i])  # Create each row for the final data.frame. 
52     colnames(tmp) <- c("hospital", "state") 
53     Final <- rbind(Final, tmp)
54   }
55   
56   Final
57 }

这个函数要求返回一个data.frame,而不是简单医院的名称。那么就需要分别读取每个州的数据,然后进行上一个函数的操作,最后再把州和相应的医院名整合到这个data.frame中。在读取每个州的操作和前面的都如出一辙。只是因为要历遍每一个州,需要用到循环。这里我构建一个list保存唯一的每个州的州名,用的是Statelist <- as.character(unique(data$State)) 这个操作。其中unique表示把重复元素剔除,只保留不重复的唯一元素,但要注意格式,剔除以后可能也需要转码。

这样三个函数写完,大部分题目都测试正确。但是在判断“worst”时候,有时会出现问题。做细节排查,发现R在我写的程序里判断 else if (num == "worst") 时有时会报错:

> if(num == “best”){num<-1} else if(num == "worst"){num <- N} else {}

Error: unexpected input in "if(num == ?

希望日后能再搞清楚。

最后得吐槽一下Coursera改版以后的界面用户体验真实差到shit一样。

总之,从开始的试试看到发现R真的很好玩。同时这又是我在博客园上的第一篇博文。希望以后也像其他大牛们一样坚持写博客,并会继续学习R下去的~~

 

Coursera系列-R Programming (John Hopkins University)-Programming Assignment 3

标签:

原文地址:http://www.cnblogs.com/yifeili/p/5437384.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!