R语言空间换时间算法、Hash键值对在字符串处理中的应用

时间：2015-06-05 10:15:07 阅读：247 评论：0 收藏：0 [点我收藏+]

最近一直在处理交通数据，有时间、车牌，经过的路口地址，数据量较大，本篇针对各车经过的路口时间先后顺序，生成贵阳交通的可通行有向图，即相连的交通路口间是否是双向通行、单向通行。

一、关于数据的说明

车牌号，路口地址都是字符串
时间是日期时间格式
数据量大概有680万

二、原始算法代码

rm(list=ls(all=TRUE))
gc()
library(RODBC)

channel=odbcConnect("transport-connector-R", uid="transport", pwd="transport")  #连接mysql test 数据库
sqlTables(channel)  # 显示test数据库中的表格




#检索test.transport20140901 中贵阳的车辆信息，含车牌，经过的路口
transections_data<-sqlQuery(channel,"select plate,address from transport20140901 where plate like '贵A%' order by plate,time")
odbcClose(channel)

# 读取文件中排序好的路口地址数据
address_file <-file("/home/wanglinlin/transport/address.txt","r") 
sorted_address <-readLines(address_file)
close(address_file)
#sorted_address[256]

#生成贵阳交通路口连通性有向图初始矩阵
transection_count <- length(sorted_address)
tansport_map <- matrix(0,transection_count,transection_count)
#tansport_map

#根据目标地址名称，在地址表中查找其位置编号
find_address<- function(target,address_table){
  len=length(address_table)
  for(i in 1:len)
    if(target==address_table[i])
      return (i)
  return (0)
}

#根据贵阳本地车辆信息，生成贵阳交通图的双向有向图矩阵
transport_data_count <- 6725490 
counter <- transport_data_count-1
transection_id_one=find_address(transections_data[1,2],sorted_address)
for (i in 1:counter){
  transection_id_two=find_address(transections_data[i+1,2],sorted_address)
  if (transections_data[i,1]==transections_data[i+1,1]){
    tansport_map[transection_id_one,transection_id_two] <- 1
  }
  transection_id_one <- transection_id_two
}
write.table(tansport_map,"/home/wanglinlin/transport/tansport_map_two.txt",row.names = FALSE,col.names = FALSE)

上述代码核心为for循环中的语句，for的循环次数不可能减少了，循环中耗时的操作有两个：

<ul><li><span style="font-family: Arial, Helvetica, sans-serif;">find_address(transections_data[i+1,2],sorted_address)</span></li><li><span style="font-family: Arial, Helvetica, sans-serif;">transections_data[i,1]==transections_data[i+1,1]</span></li></ul>

这两个操作分别是在数组中查找字符串的位置（当前路口地址在地址列表中的位置），比较两个字符串是否相等（两个车牌号是否相同），都是关于字符串的操作，相当耗时。

事实上，find_address已经是优化过的操作了，之前最初是which函数，找到所有匹配的位置，返回第一个位置，每次查找都要遍历整个列表。

上图昨天下午3:00左右开始运行的程序，循环次数截止今天上午9:15仅运行了14412次，要运行完整个程序耗时要几十天时间，是不可接受的。

昨晚一直在改算法，希望不进行字符串操作即可以完成，最好的操作就是把车牌号，路口地址数字化，用数字对比，效率会大大提高。

这样，最好的解决方案就是用散列或者hash键值对操作，找了好半天，终于找到R的hash包可以进行这样的操作。用hash包把路口地址和车牌转换成键值对保存在内存中，把对路口地址位置的查找转换成hash值的查找，把车牌号的对比转换成车牌hash值的查找。

最后代码如下：

rm(list=ls(all=TRUE))
gc()
library(RODBC)
library(hash)

channel=odbcConnect("transport-connector-R", uid="transport", pwd="transport")  #连接mysql test 数据库
sqlTables(channel)  # 显示test数据库中的表格




#检索test.transport20140901 中贵阳的车辆信息，含车牌，经过的路口
transections_data<-sqlQuery(channel,"select plate,address from transport20140901 where plate like '贵A%' order by plate,time")

#找出贵阳所有车牌号，并散列化，形成键值对表
plates<-sqlQuery(channel,"select distinct plate from transport20140901 where plate like '贵A%'")
odbcClose(channel)
plate_list=(as.matrix(plates))[,1]
plate_count=length(plate_list)
plate_hash_pairs=hash(plate_list,1:plate_count)


# 读取文件中排序好的路口地址数据
address_file <-file("/home/wanglinlin/transport/address.txt","r") 
sorted_address <-readLines(address_file)
sorted_address_hash_pairs<-hash(sorted_address,1:269)
close(address_file)
#sorted_address[256]

#生成贵阳交通路口连通性有向图初始矩阵
transection_count <- length(sorted_address)
transport_map <- matrix(0,transection_count,transection_count)
#tansport_map


#根据贵阳本地车辆信息，生成贵阳交通图的双向有向图矩阵
transport_data_count <- 6725490 
counter <- transport_data_count-1
plate_hash_pairs[[as.character(transections_data[1,1])]]
plate_hash_pairs[[as.character(transections_data[2,1])]]
sorted_address_hash_pairs[[as.character(transections_data[1,2])]]
sorted_address_hash_pairs[[as.character(transections_data[2,2])]]
for (i in 1:counter){
  if (plate_hash_pairs[[as.character(transections_data[i,1])]]==plate_hash_pairs[[as.character(transections_data[i+1,1])]]){
    transport_map[sorted_address_hash_pairs[[as.character(transections_data[i,2])]],sorted_address_hash_pairs[[as.character(transections_data[i+1,2])]]] <- 1
  }
}
write.table(transport_map,"/home/wanglinlin/transport/transport_map.txt",row.names = FALSE,col.names = FALSE)

最终结果，今天早上8点半到电脑前，发现已经运行完了，结果数据就不再展示了。

总的算法效率提高了几百倍。

R语言空间换时间算法、Hash键值对在字符串处理中的应用

标签：r hash 交通

原文地址：http://blog.csdn.net/gufe_hfding/article/details/46371819

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行