R5—字符串处理/正则表达式

时间：2015-04-23 02:04:02 阅读：346 评论：0 收藏：0 [点我收藏+]

标签：

R通常被用来进行数值计算比较多，字符串处理相对较少，而且关于字符串的函数也不多，用得多的就是substr、strsplit、paste、regexpr这几个了。实际上R关于字符串处理的功能是非常强大的，因为它甚至可以直接使用Perl的正则表达式，这也是R的一个理念，作为语言就把向量计算做到极致，作为环境，就在各领域都集成最好的。R中有grep系列的函数，可以用最强大的方式处理字符串的所有问题。

　grep的全称是global search regular expression and print out the line，是Unix下一种强大的文本搜索工具，可以通过正则表达式搜索文本，并把匹配的行打印出来，包括grep、egrep和fgrep（egrep是扩展的grep，fgrep是快速的搜寻方式并没有真正利用正则表达式）。Linux下使用GNU版的grep，该套规范也被广泛地使用，R中的grep函数就是其中之一。

grep的核心就是正则表达式（Regular Expressions，通常缩写为regex），所谓正则表达式，就是用某种模式去匹配一类字符串的一个公式，很多文本编辑器或者程序语言都支持该方式进行字符串的操作，最开始是由上文介绍的Unix工具grep之类普及的，后来得到广泛应用。尤其是Perl语言中将正则表达式发挥到了极致。

　　R中的正则表达式非常专业，从grep系列函数的参数就可以看出，有个参数“extended”，默认为T，表示使用扩展grep，也就是egrep，如果选择为F就表示基础的grep，不过该种方式不被R推荐，即使使用了也会出现警告，实际上grep能做的egrep也都能做，而且还要简单不少。我刚开始在egrep中使用总是不能通过，后来发现其实egrep中更简单，很多时候直接写在[]内就行。还有一个参数“perl”，默认为F，如果选择T表示使用Perl的正则表达式规则，功能更加强大，不过如果没有专门学过Perl语言的话用egrep也就够了。另一个参数“fixed”虽然描述的不是同一个东西，但是也很相关，选择之后就会进行精确的匹配，不再使用正则表达式的规则，在效率上会快很多，我觉得这个可能就是fgrep。R的帮助文档中也明确说明了这三个参数实际上代表了四种模式，常规grep、扩展grep、Perl正则表达式、精确匹配，使用者可以根据具体的含义选择自己需要的，如果参数设置互有冲突，会自动忽略后面的参数，并会在Warning中明确指出。

　grep系列函数其实包括grep、grepl、sub、gsub、regexpr、gregexpr，他们的参数很类似，在R中也是把帮助文档集成在了一起，查找任意一个都会得到一个统一的文档。里面对各个参数也是一起介绍的，除了刚才说的三个以外，第一个参数就是最重要的“pattern”，这是一个字符串，直接表示正则表达式，根据模式的不同注意规则就行，另外有个“x”表示要查找的向量，这也是R中的独特之处，不是查找文件，而是查找向量，该处也可以只输入一个字符串，就成了基础的字符串处理函数。对于grep函数，结果只有匹配或者不匹配，因此匹配时输出向量中该元素的下标，如果是单个字符就输出1，对于grepl，和grep其实一样，不过输出的是逻辑值，匹配就是T，不匹配就是F。参数“value”默认为F，输出的值就是刚才说的元素下标或者逻辑值，如果改成T，就会输出查找的字符串。还有一个参数“ignore.case”，默认是F，表示大小写敏感，可以改为T，表示大小写不敏感。参数“useBytes”默认是F，表示按字符查找，如果是T则表示按字节查找，对于中文字符影响还是很大的。参数“invert ”默认为F，表示正常的查找，如果为T则查找模式的补集。像sub和gsub这样的替换函数，还多一个参数“replacement”，用来表示替换的字符。

这些函数的参数都比较类似，但是输出各不一样，grep输出向量的下标，实际上就是找到与没找到，grepl返回的逻辑值更能说明问题。sub是一个很强大的替换函数，远胜过substr，正则表达式中可以设置非常灵活的规则，然后返回被替换后的字符串，如果正则表达式写得好，基本可以解决所有子字符串的问题。sub函数和gsub函数唯一的差别在于前者匹配第一次符合模式的字符串，后者匹配所有符合模式的字符串，也就是说在替换的时候前者只替换第一次符合的，后者替换所有符合的。regexpr和gregexpr被使用的似乎比较多，因为它们很像其他语言中的instr函数，可以查找到某些字符在字符串中出现的位置，不过我觉得用处并不是很大，因为通常情况下寻找某字符位置的目的就是为了做相关处理，而sub都能搞定。regexpr和gregexpr的关系和sub与gsub差不多，gregexpr操作向量时会返回列表。

　以上就是grep系列函数的一些用法，根据例子可以很方便地使用，个人建议使用参数“pattern”和“x”就行（sub和gsub当然还有replacement），其他的都用默认的。在pattern中按照egrep的规则写正则表达式，基本上可以解决所有的字符串处理问题。只需要对正则表达式有简单的了解，就可以得到R中这些强大的功能。关于正则表达式的用法就在后文中分解了。

R中的grep、grepl、sub、gsub、regexpr、gregexpr等函数都使用正则表达式的规则进行匹配。默认是egrep的规则，也可以选用Perl语言的规则。在这里，我们以R中的sub函数为例（因为该函数可以返回替换字符串后的具体内容）介绍正则表达式的用法。

　　对该函数的逻辑参数都使用默认值（ignore.case = FALSE，表示大小写敏感；extended = TRUE，表示使用egrep规则；perl = FALSE，表示不使用Perl规则；fixed = FALSE，表示不使用精确匹配；useBytes = FALSE，表示按字符匹配）。另外三个中，pattern为字符串表示正则表达式，replacement也是字符串表示替换的内容，x为字符型向量表示被替换的字符向量。该函数会根据pattern的规则对x中各元素进行搜索，遇到符合条件的第一个子字符串的位置（gsub是替换所有符合条件的），用replacement替换该子字符串，返回替换后的结果，和x的结构相同。为了清晰地介绍例子，我们对replacement统一赋值为“”，相当于去掉搜寻出来的子字符串。例如sub("a","",c("abcd","dcba"))，将向量中的两个字符串中的a都去掉了，返回[1] "bcd" "dcb"。该例中的"a"只是一个字符，并不是正则表达式，真正的正则表达式依靠元字符进行灵活的匹配。

“$”匹配一个字符串的结尾，比如sub("a$","",c("abcd","dcba"))表示将以a结尾的字符串中的a替换成空。"."表示除了换行符以外的任一字符，比如sub("a.c","",c("abcd","sdacd"))。“*”表示将其前的字符进行0个或多个的匹配，比如sub("a*b","",c("aabcd","dcaaaba"))。类似地，“?”匹配0或1个正好在它之前的那个字符，“+”匹配1或多个正好在它之前的那个字符。“.*”可以匹配任意字符，比如sub("a.*e","",c("abcde","edcba"))。

　　“|”表示逻辑的或，比如sub("ab|ba","",c("abcd","dcba"))，可以替换ab或者ba。“^”还可以表示逻辑的补集，需要写在“[]”中，比如sub("[^ab]","",c("abcd","dcba"))，由于sub只替换搜寻到的第一个，因此这个例子中用gsub效果更好。

　　“[]”还可以用来匹配多个字符，如果不使用任何分隔符号，则搜寻这个集合，比如在sub("[ab]","",c("abcd","dcba"))中，和"a|b"效果一样。“[-]”的形式可以匹配一个范围，比如sub("[a-c]","",c("abcde","edcba"))匹配从a到c的字符，sub("[1-9]","",c("ab001","001ab"))匹配从1到9的数字。

以上是最基础的正则表达式元字符，在一些正则表达式的书籍和资料中有非常详细的介绍。最后需要提一下的是“贪婪”和“懒惰”的匹配规则。默认情况下是匹配尽可能多的字符，是为贪婪匹配，比如sub("a.*b","",c("aabab","eabbe"))，默认匹配最长的a开头b结尾的字串，也就是整个字符串。如果要进行懒惰匹配，也就是匹配最短的字串，只需要在后面加个“?”，比如sub("a.*?b","",c("aabab","eabbe"))，就会匹配最开始找到的最短的a开头b结尾的字串。

1 #字符串连接：
paste() #paste(..., sep = " ", collapse = NULL)
> paste("a","b","c",sep=">")
[1] "a>b>c"
2 字符串分割：
strsplit() #strsplit(x, split, extended = TRUE, fixed = FALSE, perl = FALSE)
> strsplit("a>b>c",">")
[[1]]
[1] "a" "b" "c"
3 #计算字符串的字符数：
nchar()
> nchar("a>b>c")
[1] 5
4 #字符串截取：
substr(x, start, stop)
> paste("a","b","c",sep=">")->mm
> substr(mm, 1, 3)
[1] "a>b"
substring(text, first, last = 1000000L)
> substring(mm, 1, 3)
[1] "a>b"
> substring(mm,1,1:3)
[1] "a"   "a>"  "a>b"
解释
1代表从第一个字母开始
1:3代表取(1,1),(1,2),(1,3)
substr(x, start, stop) <- value
以下是进行字符串的替换
substring(text, first, last = 1000000) <- value
#字符串替换及大小写转换：
chartr(old, new, x)
> chartr("a","dd",mm)
[1] "d>b>c"
注:只能替换相同的字符数
toupper(x)
> toupper(mm)
[1] "A>B>C"
tolower(x)
> tolower(mm)
[1] "a>b>c"
casefold(x, upper = TRUE)
> casefold(mm, upper = TRUE)
[1] "A>B>C"
红色的起着重要作用.FALSE的话就还是和以前一样了

（1）操作连接的函数

技术分享（1）函数总体描述：

Functions to create, open and close connections.

技术分享（2）用法：

file(description = "", open = "", blocking = TRUE,
     encoding = getOption("encoding"), raw = FALSE)
@ 对于file函数，description指的其实就是文件路径或者一个完整的URL，或者""（这个默认的，或者"clipboard"剪贴板）
url(description, open = "", blocking = TRUE,
    encoding = getOption("encoding"))
@对于url，description是一个完整的URL，包括scheme (such as http://, ftp:// or file://）
gzfile(description, open = "", encoding = getOption("encoding"),
       compression = 6)
@ gzfile函数，description指的是一个由gzip压缩后的文件路径，也可以用于打开非压缩文件和由bzip2，xz或lzma压缩后的文件。
bzfile(description, open = "", encoding = getOption("encoding"),
       compression = 9)
@ bzfile的description指的是由bzip2压缩后的文件路径
xzfile(description, open = "", encoding = getOption("encoding"),
       compression = 6)
@ xzfile的description指的是由xz压缩后的文件路径或者lzma（只能读取）压缩后的文件路径。
unz(description, filename, open = "", encoding = getOption("encoding"))
@ unz函数只能以二进制方式读取zip文件中的单个文件，description指的是zip文件的完整路径。
pipe(description, open = "", encoding = getOption("encoding"))
@ 这个很少用。description指的是来自命令行，一般在shell中运行。
fifo(description, open = "", blocking = FALSE,encoding =getOption("encoding"))
@ fifo的description是fifo的路径
socketConnection(host = "localhost", port, server = FALSE,
                 blocking = FALSE, open = "a+",
                 encoding = getOption("encoding"),
                 timeout = getOption("timeout"))
open(con, ...)
## S3 method for class ‘connection‘
open(con, open = "r", blocking = TRUE, ...)
close(con, ...)
## S3 method for class ‘connection‘
close(con, type = "rw", ...)
flush(con)
isOpen(con, rw = "")
isIncomplete(con)

技术分享 （3）常用参数说明

description—字符串：用于描述连接
open—字符串：用于描述如何打开连接
encoding—指定编码

（4）返回值说明

file, pipe, fifo, url, gzfile, bzfile, xzfile, unz and socketConnection均返回一个连接对象（继承自"connection"类）

技术分享 （5）Modes说明

@ Possible values for the argument open are

"r" or "rt"
Open for reading in text mode.

"w" or "wt"
Open for writing in text mode.

"a" or "at"
Open for appending in text mode.

"rb"
Open for reading in binary mode.

"wb"
Open for writing in binary mode.

"ab"
Open for appending in binary mode.

"r+", "r+b"
Open for reading and writing.

"w+", "w+b"
Open for reading and writing, truncating file initially.

"a+", "a+b"
Open for reading and appending

（6）例子说明

@ file函数的例子
zz <- file("ex.data", "w") # open an output file connection
cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") # cat：output a object.
cat("One more line\n", file = zz)
close(zz)
readLines("ex.data")
unlink("ex.data") # deletes the files or directories specified by x

@ gzfile函数的例子（.gz压缩文件）
zz <- gzfile("ex.gz", "w") # compressed file
cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
close(zz)
readLines(zz <- gzfile("ex.gz"))
close(zz)
unlink("ex.gz")

@ bzfile函数的例子（ex.bz2压缩文件）
zz <- bzfile("ex.bz2", "w") # bzip2-ed file
cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
close(zz)
print(readLines(zz <- bzfile("ex.bz2")))
close(zz)
unlink("ex.bz2")

## An example of a file open for reading and writing

Tfile <- file("test1", "w+")
c(isOpen(Tfile, "r"), isOpen(Tfile, "w")) # both TRUE
cat("abc\ndef\n", file = Tfile)
readLines(Tfile)
seek(Tfile, 0, rw = "r") # reset to beginning
readLines(Tfile)
cat("ghi\n", file = Tfile)
readLines(Tfile)
close(Tfile)
unlink("test1")

## We can do the same thing with an anonymous file.

Tfile <- file()
cat("abc\ndef\n", file = Tfile)
readLines(Tfile)
close(Tfile)

## examples of use of encodings

## examples of use of encodings
# write a file in UTF-8
cat(x, file = (con <- file("foo", "w", encoding = "UTF-8"))); close(con)
# read a ‘Windows Unicode‘ file
A <- read.table(con <- file("students", encoding = "UCS-2LE")); close(con)

（2）转化二进制数据到一个连接或从连接处读取二进制数据（Transfer Binary Data To and From Connections）

技术分享 （1）函数总体描述

Read binary data from a connection, or write binary data to a connection.

技术分享 （2）用法

readBin(con, what, n = 1L, size = NA_integer_, signed = TRUE, endian = .Platform$endian)
writeBin(object, con, size = NA_integer_, endian = .Platform$endian, useBytes = FALSE)

（3）参数说明：

con：一个连接对象、代表文件名的字符串、一个raw vector
what：一个对象，该对象的mode将会指定要读取向量的mode。或是一个描述了mode的长度为1的字符向量，可选的mode有： "numeric", "double", "integer", "int", "logical", "complex", "character", "raw".
n：整数，读取的最大记录数。
size：整数，在字节流中每个元素的字节数。默认的NA_integer_使用的是natural size。
signed：逻辑值（暂时不管）
object：一个要写进连接的R对象

技术分享 （6）例子说明

zz <- file("testbin", "wb")
writeBin(1:10, zz)
writeBin(pi, zz, endian = "swap")
writeBin(pi, zz, size = 4)
writeBin(pi^2, zz, size = 4, endian = "swap")
writeBin(pi+3i, zz)
writeBin("A test of a connection", zz)
z <- paste("A very long string", 1:100, collapse = " + ")
writeBin(z, zz)
if(.Machine$sizeof.long == 8 || .Machine$sizeof.longlong == 8)
writeBin(as.integer(5^(1:10)), zz, size = 8)
if((s <- .Machine$sizeof.longdouble) > 8)
writeBin((pi/3)^(1:10), zz, size = s)
close(zz)

zz <- file("testbin", "rb")
readBin(zz, integer(), 4)   # 这里的integer( )就是上面那个what参数。
readBin(zz, integer(), 6)
readBin(zz, numeric(), 1, endian = "swap")
readBin(zz, numeric(), size = 4)
readBin(zz, numeric(), size = 4, endian = "swap")
readBin(zz, complex(), 1)
readBin(zz, character(), 1)
z2 <- readBin(zz, character(), 1)
if(.Machine$sizeof.long == 8 || .Machine$sizeof.longlong == 8)
    readBin(zz, integer(), 10, size = 8)
if((s <- .Machine$sizeof.longdouble) > 8)
    readBin(zz, numeric(), 10, size = s)
close(zz)
unlink("testbin")
stopifnot(z2 == z)

（3）XML解析器

技术分享 （1）总体描述

解析（Parse）一个XML或HTML文件或包括XML/HTML内容的字符串，生成一个代表XML/HTML数的R结构。

xmlParse和htmlParse分别等价于 xmlTreeParse和htmlTreeParse。

（2）用法说明

xmlTreeParse(file, ignoreBlanks=TRUE, handlers=NULL, replaceEntities=FALSE,
             asText=FALSE, trim=TRUE, validate=FALSE, getDTD=TRUE,
             isURL=FALSE, asTree = FALSE, addAttributeNamespaces = FALSE,
             useInternalNodes = FALSE, isSchema = FALSE,
             fullNamespaceInfo = FALSE, encoding = character(),
             useDotNames = length(grep("^\\.", names(handlers))) > 0,
             xinclude = TRUE, addFinalizer = TRUE, error = xmlErrorCumulator(),
             isHTML = FALSE, options = integer(), parentFirst = FALSE)
htmlTreeParse
htmlParse
xmlParse

技术分享 （3）主要参数说明

file：包括了XML内容的文件名（注意：也可以是一个URL或者一个gzip压缩文件，该函数可以直接将其解压）
ignoreBlanks：逻辑值，用来说明全部由空格（white space）组成的text elements是否应该被包含在结果tree中。
handlers：可选择的函数集合，用于将不同的XML节点映射到R对象上. 一般来说，这是一个函数名称列表. 主要用于提供一种当树在R中创建时过滤tree，添加或移除节点的方式
asText：逻辑值，用于说明第一个参数"file"应该被当做XML 文本来解析，而不是一个文件名. 这就允许文件的内容可以来自不同的源头（比如HTTP服务器, XML-PRC等等）
trim：逻辑值，是否去掉文本字符串首位的空格
isURL：用于说明第一个参数"file"是否表示一个URL（通过ftp/http访问）还是系统的一个普通文件。如果asText=TRUE，这个参数可以不指定。
useInternalNodes：逻辑值，用于说明是调用转换函数XMLInternalNode类对象还是XMLNode。

（4）官方例子

fileName <- system.file("exampleData", "test.xml", package="XML")
#1） parse the document and return it in its standard format.
xmlTreeParse(fileName)
技术分享
这里需要注意一下system.file的用法，在一个包中查找某一个文件的完整文件名，并返回该文件所在系统的完整路径。

#2） parse the document, discarding comments.
注意到：在test.xml文件中有一个段落是注释段落
技术分享

xmlTreeParse(fileName, handlers=list("comment"=function(x,...){NULL}), asTree = TRUE)
技术分享

#3) Parse some XML text.
# Read the text from the file
xmlText <- paste(readLines(fileName), "\n", collapse="")
print(xmlText)
xmlTreeParse(xmlText, asText=TRUE)
技术分享

（4）readHTMLTable：从一个多个HTML表格中读取数据

技术分享 （1）函数的总体描述

该函数提供了稳健的方法用于从一个HTML文件的HTML表格提取数据。可以通过制定一个文件名或者URL，或者已经被htmlParse函数解析过得文件中所有的表格（tables）。此外，还可以文件中的单独<table>节点。

技术分享 （2）用法说明

readHTMLTable(doc, header = NA,
              colClasses = NULL, skip.rows = integer(), trim = TRUE,
              elFun = xmlValue, as.data.frame = TRUE, which = integer(),
               ...)

技术分享 （3）参数说明

doc：HTML文件，可以是一个文件名、一个URL、已经解析后的HTMLInternalDocument，XMLInternalElementNode类的一个HTML节点、一个包含有有要解析和处理HTML内容的字符串向量
header：逻辑值（用于说明表格是否有列名，例如第一行）；或者一个字符串向量（为获得的各列指定列名）。另外，可以通过which参数来读取一个指定的表格。
colClasses：一个列表或一个向量（为表中的不同列指定数据类型的名称）；或者是一个函数（将字符值转换成合适的数据类型）。NULL值意味着，我们将会从结果中删除该列（drop that column）。需要注意的是：如果指定了as.data.frame=TRUE，那么在向量被转化成数据框前就会发生类型转化，因此，为了保证字符串向量仍然保持为一个字符串而不是被转化成因子（factors），需要使用stringsAsFactors = FALSE
skip.rows：整数向量（表明忽略掉那一列，即不读取哪一列）
trim：都懂得
elFun：（略）
as.data.frame：逻辑值（用于说明是否将获得表格转化成数据框，还是继续返回矩阵）
which：整数值向量（识别从文档中返回哪一个表格）。该方法用于整个document而不是单个的表格
. . . ：如果 as.data.frame 为TRUE，那么，这个参数就表示额外传给 as.data.frame 函数的参数。

技术分享 （4）官方例子

例子一：目标网址：http://en.wikipedia.org/wiki/World_population

打开该网址查看结果如下：

技术分享

实战演示：

@ 最简单的获取document中表格的方式

u = "http://en.wikipedia.org/wiki/List_of_countries_by_population"

tables = readHTMLTable(u)
names(tables)

tables

技术分享

tables[[1]]

# Print the table. Note that the values are all characters
# not numbers. Also the column names have a preceding X since
# R doesn‘t allow the variable names to start with digits.

@利用getNodeSet获取表格节点的方式获取指定表格数据.

# Let‘s just read the first table directly by itself.
doc = htmlParse(u) #返回的内容就如同鼠标右击查看源代码的文档内容.
tableNodes = getNodeSet(doc, "//table") # getNodeSet(doc, path)，从结构文档中提取节点间（用path指定）的数据.
tb = readHTMLTable(tableNodes[[1]]) # 提取节点中的表格数据

例子二，目标网址：http://www.nber.org/cycles/cyclesmain.html

打开网址查看如下

技术分享

实战演示：
doc <- "http://www.nber.org/cycles/cyclesmain.html"
# The main table is the second one because it‘s embedded in the page table.
table <- getNodeSet(htmlParse(doc),"//table") [[2]]
xt <- readHTMLTable(table,
                    header = c("peak","trough","contraction",
                               "expansion","trough2trough","peak2peak"),
                    colClasses = c("character","character","character",
                                   "character","character","character"),
                    trim = TRUE, stringsAsFactors = FALSE
                   )

（5）getNodeSet：在一个internal XML tree/DOM中找到匹配的节点

These functions provide a way to find XML nodes that match a particular criterion. It uses the XPath syntax and allows very powerful expressions to identify nodes of interest within a document both clearly and efficiently

技术分享 （1）用法说明

getNodeSet(doc, path, namespaces = xmlNamespaceDefinitions(doc, simplify = TRUE), fun = NULL, sessionEncoding = CE_NATIVE, addFinalizer = NA, ...)
xpathApply(doc, path, fun, ... ,
            namespaces = xmlNamespaceDefinitions(doc, simplify = TRUE),
              resolveNamespaces = TRUE, addFinalizer = NA)
xpathSApply(doc, path, fun = NULL, ... ,
             namespaces = xmlNamespaceDefinitions(doc, simplify = TRUE),
               resolveNamespaces = TRUE, simplify = TRUE, addFinalizer = NA)
matchNamespaces(doc, namespaces,
      nsDefs = xmlNamespaceDefinitions(doc, recursive = TRUE, simplify = FALSE),
      defaultNs = getDefaultNamespace(doc, simplify = TRUE))

技术分享 （2）参数说明

doc：一个XNLInternalDocument
path：一个字符串：长度为1的字符向量（用于给定要估算的XPath表达式）
namespace：一个名称字符向量（指定用在XPath表达式和节点匹配中的命名空间prefix前缀和URI对）.这个prefix只是URI的一个别名（alias），是命名空间的唯一标识符（unique identifier）. URI是名称字符串向量中的元素，prefix是相应的元素名称。我们只需要在XPath表达式中指定命名空间以及感兴趣的节点即可，而不必指定整个document的命名空间。prefix（即namespace）不一定要和document中的一样，但是URI必须与document中的目标命名空间URI一致. 因为我们使用的是命名空间的URIs来精确匹配。
fun：一个函数对象，表达式或一次调用
. . . ：any additional arguments to be passed to fun for each node in the node set.
simplify：逻辑值（指定是否该函数应该将结果返回一个向量而不是一个列表）

（3）细节说明

这个例子说的好啊！！！
When a namespace is defined on a node in the XML document, an XPath expressions must use a namespace, even if it is the default namespace for the XML document/node. For example, suppose we have an XML document <help xmlns="http://www.r-project.org/Rd"><topic>...</topic></help> To find all the topic nodes, we might want to use the XPath expression "/help/topic". However, we must use an explicit namespace prefix that is associated with the URI http://www.r-project.org/Rd corresponding to the one in the XML document. So we would use getNodeSet(doc, "/r:help/r:topic", c(r = "http://www.r-project.org/Rd")).

（4）官网例子

@ Ex01

doc = xmlParse(system.file("exampleData", "tagnames.xml", package = "XML"))
els = getNodeSet(doc, "/doc//a[@status]")
sapply(els, function(el) xmlGetAttr(el, "status")) #xmlGetAttr(node, name) 该函数用于从一个XML节点中获取named attribute的值。

@ EX02

# use of namespaces on an attribute.
getNodeSet(doc, "/doc//b[@x:status]", c(x = "http://www.omegahat.org"))
getNodeSet(doc, "/doc//b[@x:status=‘foo‘]", c(x = "http://www.omegahat.org"))

@ EX03
还有很多例子，后面慢慢看吧！

以下是字符串操作基础中遇到的函数！！！

（5）chartr 函数：字符串

函数总体描述

将字符串转化成字符向量，特别地从小写转化成大写或反之（vice versa！）

技术分享 用法说明

chartr(old, new, x)
tolower(x) # 简单
toupper(x) # 简单
casefold(x, upper = FALSE)

参数说明

x：一个字符向量，或者能够被as.character函数强转成字符的一个对象
old：一个字符串（即要转化的字符串）. 如果old是一个长度为2或超过2的字符串向量，第一个元素 used with a warning。
new：一个字符串（即转化成的字符串）. 同上.
upper：逻辑值，表示转化成大写还是小写？

技术分享 细节说明

chartr 函数将 x 中的在old指定了的每一个字符转化成new中相应的字符。

casefold 是tolower和toupper函数的一个封装（wrapper！）

技术分享 返回值

上面的函数都返回与x相同长度的一个字符向量

技术分享 官网例子

1）基本的例子

技术分享

2）## "Mixed Case" Capitalizing - toupper( every first letter of a word )

对于存在大小写的一段字符串，实现将其首字母大写的功能

step1：先写一个简单的函数

.simpleCap <- function(x) {
s <- strsplit(x, " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "", collapse = " ")
}

step2：测试

.simpleCap("the quick red fox jumps over the lazy brown dog")

3）## and the better, more sophisticated version:

再举一个更复杂一点例子！

step1：写函数（这个函数还没怎么看懂呀？？？）

capwords <- function(s, strict = FALSE) { # 外函数

cap <- function(s) # 内函数！！！

paste(toupper(substring(s, 1, 1)) , { s <- substring(s, 2); if(strict) tolower(s) else s } , sep = "", collapse = " " )

sapply(strsplit(s, split = " "), cap , USE.NAMES = !is.null(names(s)))
}

step2：测试

capwords(c("using AIC for model selection"))
## -> [1] "Using AIC For Model Selection"

4）## -- Very simple insecure crypto --

rot <- function(ch, k = 13) {
   p0 <- function(...) paste(c(...), collapse = "")
   A <- c(letters, LETTERS, " ‘")
   I <- seq_len(k); chartr(p0(A), p0(c(A[-I], A[I])), ch)
}
pw <- "my secret pass phrase"
(crypw <- rot(pw, 13)) #-> you can send this off
## now ``decrypt‘‘ :
rot(crypw, 54 - 13) # -> the original:
stopifnot(identical(pw, rot(crypw, 54 - 13)))

技术分享

（6）nchar函数

# Count the Number of Characters (or Bytes or Width)

计算字符数、字节数或长度

函数总体说明

nchar接受一个字符向量作x为参数，返回一个向量（向量中的元素包含了x相应元素的大小）

技术分享 用法说明

nchar(x, type = "chars", allowNA = FALSE)
nzchar(x)

技术分享 参数说明

x：字符向量，或者可以被强转成字符向量的一个向量
type：字符串，其值是 c("bytes", "chars", "width")这三者之一
allowNA：逻辑值，对于 invalid multibyte strings是否应该返回NA值，或者对于"bytes"-encoded strings也同样处理。而不是抛出一个错误(error!!!!)

细节说明

一个字符串的"size"可以通过以下三种方式来测量

bytes
The number of bytes needed to store the string (plus in C a final terminator which is not counted).
chars
The number of human-readable characters.
width
The number of columns cat will use to print the string in a monospaced font. The same as chars if this cannot be calculated.

技术分享 官网例子

x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
nchar(x)
# 5 6 6 1 15
nchar(deparse(mean))
# 18 17

技术分享

（7）sets 集合操作函数

技术分享 函数总体描述

Performs set union, intersection, (asymmetric!) difference, equality and membership on two vectors

对两个向量进行集合的并集、交集、补集等操作！！！

用法说明

union(x, y)
intersect(x, y)
setdiff(x, y)
setequal(x, y)
is.element(el, set)

技术分享 参数说明

x，y，el，set：类型相同的向量（包含一系列未重复的项！！！）

技术分享 细节说明

union, intersect, setdiff and setequal这几个函数都会丢弃参数中任何的重复值！

技术分享 官网例子

(x <- c(sort(sample(1:20, 9)), NA))
(y <- c(sort(sample(3:23, 7)), NA))
union(x, y)   # 并集
intersect(x, y) # 交集
setdiff(x, y)    # 补集（其实等于 x与y的交集与x的交集）
setdiff(y, x)    # 补集（其实等于 x与y的交集与y的交集）
setequal(x, y) # 判断两个集合是否相等！！！
## True for all possible x & y :
setequal( union(x, y), c(setdiff(x, y), intersect(x, y), setdiff(y, x)))
is.element(x, y) # length 10   # 返回逻辑值，如果x中有元素在集合y中，返回值的对应位置为TRUE，否则为FALSE！
is.element(y, x) # length 8

（8）匹配函数：match pmatch charmatch

@ 先来说match

技术分享 函数总体描述

match 返回一个位置向量。

# %in% 函数是一个二进制操作符更加直观的接口，它返回一个逻辑值，用于表明它的left operand是否有匹配！！！

技术分享 用法说明

match(x, table, nomatch = NA_integer_, incomparables = NULL)
x %in% table
match与%in%的区别在于：前者返回一个匹配到的坐标向量，未匹配到返回NA，而后者返回TRUE或FALSE，不返回NA。

技术分享 参数说明

x：向量或NULL值（匹配值： the values to be matched，支持long vectors）
table：向量或NULL值（匹配对象：the values to be matched against，不支持long vectors）
nomatch：当没有找到匹配时，返回的值（注意，该参数的值会被强转为整数）
incomparables：不能被匹配的一个值向量。只要x中的任何值在该值向量中匹配到一个值就赋给nomatch 值。由于历史原因，FALSE等价于NULL

技术分享 官网例子

## The intersection of two sets can be defined via match():
## Simple version:
## intersect <- function(x, y) y[match(x, y, nomatch = 0)]
intersect # the R function in base, slightly more careful
intersect(1:10, 7:20)

1:10 %in% c(1,3,5,9)
sstr <- c("c","ab","B","bba","c",NA,"@","bla","a","Ba","%")
sstr[sstr %in% c(letters, LETTERS)]

"%w/o%" <- function(x, y) x[!x %in% y] #-- x without y
(1:10) %w/o% c(3,7,12)

## Note that setdiff() is very similar and typically makes more sense:
c(1:6,7:2) %w/o% c(3,7,12) # -> keeps duplicates
setdiff(c(1:6,7:2), c(3,7,12)) # -> unique values

@再来说pmatch

技术分享用法说明

pmatch(x, table, nomatch = NA_integer_, duplicates.ok = FALSE)
# duplicate.ok：table中的元素是否可以多次使用！！！

可以看到和match的参数差不多！

官网例子

pmatch("", "") # returns NA
pmatch("m", c("mean", "median", "mode")) # returns NA
pmatch("med", c("mean", "median", "mode")) # returns 2

pmatch(c("", "ab", "ab"), c("abc", "ab"), dup = FALSE)
pmatch(c("", "ab", "ab"), c("abc", "ab"), dup = TRUE)

## compare
charmatch(c("", "ab", "ab"), c("abc", "ab"))

@ 最后来说charmatch

技术分享 用法说明

charmatch(x, table, nomatch = NA_integer_)

细节说明

如果是一个单个精确匹配、未精确匹配、一个唯一的匹配，返回匹配值的索引；

如果多个精确匹配、多个部分匹配，返回0

如果为发现匹配项，返回nomatch中指定的值。

技术分享 官网例子

charmatch("", "") # returns 1
charmatch("m", c("mean", "median", "mode")) # returns 0
charmatch("med", c("mean", "median", "mode")) # returns 2

（9）模式匹配与替换（Pattern Matching and Replacement）

技术分享 （1）函数总体描述

grep、grep1、regexpr、gregexpr在一个字符向量的每一个元素中搜索与参数pattern的匹配项，只是返回值的格式和一些细节有细微差异！

sub、gsub分别进行第一个和所有匹配的替换操作！

技术分享 （2）用法说明

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
      fixed = FALSE, useBytes = FALSE)
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
     fixed = FALSE, useBytes = FALSE)
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
        fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
         fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE,fixed = FALSE, useBytes = FALSE)

（3）参数说明

pattern：包含正则表达式的字符串
x，text：用于匹配的一个字符向量、或者一个能被as.character函数强转成字符向量的一个对象。
ignore.case：如果为FALSE，模式匹配小写敏感，如果为TRUE，在进行匹配时不区分大小写。
perl：逻辑值，是否用与perl语言兼容的（正则表达式）regexps
value：如果为FALSE，返回包含由grep函数确定匹配索引（为整数）组成的向量。如果为TRUE,直接返回在x中匹配到的元素向量
fixed：逻辑值，如果TRUE，pattern是一个字符串。覆盖掉所有与之有冲突的参数。
useBytes：逻辑值，如果TRUE，匹配以byte-by-byte完成而不是character-by-character.
invert：逻辑值，如果TRUE，返回为匹配元素的索引或值
replacement：在sub、gsub中对已匹配模式的一个替换项。如果fixed=FALSE，可以包括pattern带括号的子表达式的反向引用（backreferences："\1" to "\9"）。

技术分享 （4）返回值说明（略）

技术分享 （5）官网例子

@ grep函数示例

grep("[a-z]", letters)
txt <- c("arm","foot","lefroo", "bafoobar")
if(length(i <- grep("foo", txt)))
cat("‘foo‘ appears at least once in\n\t", txt, "\n")
i # 2 and 4
txt[i]

技术分享

@ gsub函数示例

## Double all ‘a‘ or ‘b‘s; "\" must be escaped, i.e., ‘doubled‘
gsub("([ab])", "\\1_\\1_", "abc and ABC")

@ sub函数示例

## Note that in locales such as en_US this includes B as the
## collation order is aAbBcCdEe ...

ot <- sub("[b-e]",".", txt)

技术分享

@ regexpr函数和grepexpr（注意前面的g代表就是global的意思！！！）

regexpr("en", txt)
gregexpr("e", txt)

技术分享

@ 一些实用的实例

##1） trim trailing white space

## 去掉尾部空格！！！

str <- "Now is the time "
sub(" +$", "", str) ## spaces only
sub("[[:space:]]+$", "", str) ## white space, POSIX-style
sub("\\s+$", "", str, perl = TRUE) ## Perl-style white space

技术分享

##2） capitalizing

## 首字母大写的例子！！！
txt <- "a test of capitalizing"
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", txt, perl=TRUE)
gsub("\\b(\\w)", "\\U\\1", txt, perl=TRUE)

txt2 <- "useRs may fly into JFK or laGuardia"
gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", txt2, perl=TRUE)
sub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", txt2, perl=TRUE)

技术分享

## Decompose a URL into its components.

## Example by LT (http://www.cs.uiowa.edu/~luke/R/regexp.html).

x <- "http://stat.umn.edu:80/xyz"
m <- regexec("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", x)
m
regmatches(x, m)
## Element 3 is the protocol, 4 is the host, 6 is the port, and 7
## is the path. We can use this to make a function for extracting the
## parts of a URL:
URL_parts <- function(x) {
    m <- regexec("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", x)
    parts <- do.call(rbind,
                     lapply(regmatches(x, m), `[`, c(3L, 4L, 6L, 7L)))
    colnames(parts) <- c("protocol","host","port","path")
    parts
}
URL_parts(x)

技术分享

R5—字符串处理/正则表达式

标签：

原文地址：http://www.cnblogs.com/Bfrican/p/4449121.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行