标签:
在这个大选之年,美国统计协会(ASA)将学生竞赛和总统选举放在一起,将学生预测谁是2016年总统大选的赢家准确的百分比作为比赛点。详情见:
http://thisisstatistics.org/electionprediction2016/
互联网上有很多公开的民调数据。可以下面的网站获取总统大选的相关数据:
http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/
其他较好的数据源是:
http://www.realclearpolitics.com/epolls/latest_polls/
http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton
http://www.gallup.com/products/170987/gallup-analytics.aspx)
值得注意的是:数据是每天更新的,所以你在看本文的时候很可能数据变化而得到不同的结果。
因为原始的数据是JSON文件,R拉取下来将其作为了lists中的一个list(列表)。
原文的Github地址:https://github.com/hardin47/prediction2016/blob/master/predblog.Rmd
##载入需要的包 require(XML) require(dplyr) require(tidyr) require(readr) require(mosaic) require(RCurl) require(ggplot2) require(lubridate) require(RJSONIO) ##数据拉取 url = "http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/" doc <- htmlParse(url, useInternalNodes = TRUE) #爬取网页内容 sc = xpathSApply(doc, "//script[contains(., ‘race.model‘)]", function(x) c(xmlValue(x), xmlAttrs(x)[["href"]])) jsobj = gsub(".*race.stateData = (.*);race.pathPrefix.*", "\\1", sc) data = fromJSON(jsobj) allpolls <- data$polls #unlisting the whole thing indx <- sapply(allpolls, length) pollsdf <- as.data.frame(do.call(rbind, lapply(allpolls, ‘length<-‘, max(indx)))) ##数据清洗 #unlisting the weights pollswt <- as.data.frame(t(as.data.frame(do.call(cbind, lapply(pollsdf$weight, data.frame, stringsAsFactors=FALSE))))) names(pollswt) <- c("wtpolls", "wtplus", "wtnow") row.names(pollswt) <- NULL pollsdf <- cbind(pollsdf, pollswt) #unlisting the voting indxv <- sapply(pollsdf$votingAnswers, length) pollsvot <- as.data.frame(do.call(rbind, lapply(pollsdf$votingAnswers, ‘length<-‘, max(indxv)))) pollsvot1 <- rbind(as.data.frame(do.call(rbind, lapply(pollsvot$V1, data.frame, stringsAsFactors=FALSE)))) pollsvot2 <- rbind(as.data.frame(do.call(rbind, lapply(pollsvot$V2, data.frame, stringsAsFactors=FALSE)))) pollsvot1 <- cbind(polltype = rownames(pollsvot1), pollsvot1, polltypeA = gsub(‘[0-9]+‘, ‘‘, rownames(pollsvot1)), polltype1 = extract_numeric(rownames(pollsvot1))) pollsvot1$polltype1 <- ifelse(is.na(pollsvot1$polltype1), 1, pollsvot1$polltype1 + 1) pollsvot2 <- cbind(polltype = rownames(pollsvot2), pollsvot2, polltypeA = gsub(‘[0-9]+‘, ‘‘, rownames(pollsvot2)), polltype1 = extract_numeric(rownames(pollsvot2))) pollsvot2$polltype1 <- ifelse(is.na(pollsvot2$polltype1), 1, pollsvot2$polltype1 + 1) pollsdf <- pollsdf %>% mutate(population = unlist(population), sampleSize = as.numeric(unlist(sampleSize)), pollster = unlist(pollster), startDate = ymd(unlist(startDate)), endDate = ymd(unlist(endDate)), pollsterRating = unlist(pollsterRating)) %>% select(population, sampleSize, pollster, startDate, endDate, pollsterRating, wtpolls, wtplus, wtnow) allpolldata <- cbind(rbind(pollsdf[rep(seq_len(nrow(pollsdf)), each=3),], pollsdf[rep(seq_len(nrow(pollsdf)), each=3),]), rbind(pollsvot1, pollsvot2)) allpolldata <- allpolldata %>% arrange(polltype1, choice)
查看所有的选择数据:allolldata
在找出2016年美国总统竞选的预测选票比例之前,简单的查看数据是非常有必要的。数据集已经整理好了,使用ggplot2包对其进行可视化(选取2016年8月以后的数据,x轴为endDate,y轴为adj_pct,颜色根据choice也就是两种颜色克林顿和希拉里,并根据wtnow设置点的大小):
##快速可视化 ggplot(subset(allpolldata, ((polltypeA == "now") & (endDate > ymd("2016-08-01")))), aes(y=adj_pct, x=endDate, color=choice)) + geom_line() + geom_point(aes(size=wtnow)) + labs(title = "Vote percentage by date and poll weight\n", y = "Percent Vote if Election Today", x = "Poll Date", color = "Candidate", size="538 Poll\nWeight")
考虑到每位候选人的选票比例会基于当前投票的票数百分比,所以,必须基于538人(样本容量samplesize)的想法(投票举动)和投票关闭天数(day sine poll)进行选票权重设置。权重的计算公式如下:
使用计算出的权重,我将计算被预测选票百分比的加权平均和其标准偏差(SE)。标准偏差(SE)计算公式来自 Cochran (1977) 。
##快速分析 # 参考文献 # code found at http://stats.stackexchange.com/questions/25895/computing-standard-error-in-weighted-mean-estimation # cited from http://www.cs.tufts.edu/~nr/cs257/archive/donald-gatz/weighted-standard-error.pdf # Donald F. Gatz and Luther Smith, "THE STANDARD ERROR OF A WEIGHTED MEAN CONCENTRATION-I. BOOTSTRAPPING VS OTHER METHODS" weighted.var.se <- function(x, w, na.rm=FALSE) # Computes the variance of a weighted mean following Cochran 1977 definition { if (na.rm) { w <- w[i <- !is.na(x)]; x <- x[i] } n = length(w) xWbar = weighted.mean(x,w,na.rm=na.rm) wbar = mean(w) out = n/((n-1)*sum(w)^2)*(sum((w*x-wbar*xWbar)^2)-2*xWbar*sum((w-wbar)*(w*x-wbar*xWbar))+xWbar^2*sum((w-wbar)^2)) return(out) } # 计算累计平均和加权平均值Cumulative Mean / Weighted Mean allpolldata2 <- allpolldata %>% filter(wtnow > 0) %>% filter(polltypeA == "now") %>% mutate(dayssince = as.numeric(today() - endDate)) %>% mutate(wt = wtnow * sqrt(sampleSize) / dayssince) %>% mutate(votewt = wt*pct) %>% group_by(choice) %>% arrange(choice, -dayssince) %>% mutate(cum.mean.wt = cumsum(votewt) / cumsum(wt)) %>% mutate(cum.mean = cummean(pct)) View(allpolldata2 )
##绘制累计平均/加权平均Cumulative Mean / Weighted Mean # 累计平均 ggplot(subset(allpolldata2, ( endDate > ymd("2016-01-01"))), aes(y=cum.mean, x=endDate, color=choice)) + geom_line() + geom_point(aes(size=wt)) + labs(title = "Cumulative Mean Vote Percentage\n", y = "Cumulative Percent Vote if Election Today", x = "Poll Date", color = "Candidate", size="Calculated Weight") # 加权平均 ggplot(subset(allpolldata2, (endDate > ymd("2016-01-01"))), aes(y=cum.mean.wt, x=endDate, color=choice)) + geom_line() + geom_point(aes(size=wt)) + labs(title = "Cumulative Weighted Mean Vote Percentage\n", y = "Cumulative Weighted Percent Vote if Election Today", x = "Poll Date", color = "Candidate", size="Calculated Weight")
此外,加权平均和平均的标准偏差(科克伦(1977))可以对每个候选人进行计算。使用这个公式,我们可以预测主要候选人的最后的百分比!
pollsummary <- allpolldata2 %>% select(choice, pct, wt, votewt, sampleSize, dayssince) %>% group_by(choice) %>% summarise(mean.vote = weighted.mean(pct, wt, na.rm=TRUE), std.vote = sqrt(weighted.var.se(pct, wt, na.rm=TRUE))) pollsummary ## # A tibble: 2 x 3 ## choice mean.vote std.vote ## <chr> <dbl> <dbl> ## 1 Clinton 43.48713 0.5073771 ## 2 Trump 38.95760 1.0717574
显然,主要的候选人是克林顿和希拉里,克林顿的选票平均百分比高于希拉里,并且其标准偏差小于希拉里,也就是说其选票变化稳定,最后胜出的很可能就是克林顿,但是按照希拉里的变化波动大,也不排除希拉里获胜的可能。可以看到希拉里的选票比例最高曾达到51%。
原文链接:https://www.r-statistics.com/2016/08/presidential-election-predictions-2016/
本文链接:http://www.cnblogs.com/homewch/p/5811945.html
标签:
原文地址:http://www.cnblogs.com/homewch/p/5811945.html