例如我在爬取学生在线的时候,发现爬取不到特定的通知,例如《中粮福临门助学基金申请公告》,通过分析发现原来通知的链接被过滤掉了,下面对过滤url的配置文件regex-urlfilter.txt进行分析,以后如果需要修改可以根据自己的情况对该配置文件进行修改:说明:配置文件中以“#”开头的行为注释,以“-" 开头的表示符合正则表达式就过滤掉,以“+”开头的表示符合正则表达式则保留。正则表达式中"^"表示字符串的开头,"$"表示字符串的结尾,"[]"表示集合。中文部分是我添加的注释
- # Licensed to the Apache Software Foundation (ASF) under one or more
- # contributor license agreements. See the NOTICE file distributed with
- # this work for additional information regarding copyright ownership.
- # The ASF licenses this file to You under the Apache License, Version 2.0
- # (the "License"); you may not use this file except in compliance with
- # the License. You may obtain a copy of the License at
- #
- # http:
- #
- # Unless required by applicable law or agreed to in writing, software
- # distributed under the License is distributed on an "AS IS" BASIS,
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- # See the License for the specific language governing permissions and
- # limitations under the License.
-
-
- # The default url filter.
- # Better for whole-internet crawling.
-
- # Each non-comment, non-blank line contains a regular expression
- # prefixed by ‘+‘ or ‘-‘. The first matching pattern in the file
- # determines whether a URL is included or ignored. If no pattern
- # matches, the URL is ignored.
-
- # skip file: ftp: and mailto: urls
- #过滤掉file:ftp等不是html协议的链接
- -^(file|ftp|mailto):
-
- # skip image and other suffixes we can‘t yet parse
- #过滤掉图片等格式的链接
- -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-
- # skip URLs containing certain characters as probable queries, etc.
- #-[?*!@=] 过滤掉汗特殊字符的链接,因为要爬取更多的链接,所以修改过滤条件,使包含?=的链接不被过滤掉
- -[*!@]
-
- # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
- #过滤掉一些特殊格式的链接
- -.*(/[^/]+)/[^/]+\1/[^/]+\1/
-
- # accept anything else
- #接受所有的链接,这里可以做自己的修改,是的只接受自己规定类型的链接
# Licensed to the Apache Software Foundation (ASF) under one or more# contributor license agreements. See the NOTICE file distributed with# this work for additional information regarding copyright ownership.# The ASF licenses this file to You under the Apache License, Version 2.0# (the "License"); you may not use this file except in compliance with# the License. You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# The default url filter.# Better for whole-internet crawling.# Each non-comment, non-blank line contains a regular expression# prefixed by ‘+‘ or ‘-‘. The first matching pattern in the file# determines whether a URL is included or ignored. If no pattern# matches, the URL is ignored.# skip file: ftp: and mailto: urls#过滤掉file:ftp等不是html协议的链接-^(file|ftp|mailto):# skip image and other suffixes we can‘t yet parse#过滤掉图片等格式的链接-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$# skip URLs containing certain characters as probable queries, etc.#-[?*!@=] 过滤掉汗特殊字符的链接,因为要爬取更多的链接,所以修改过滤条件,使包含?=的链接不被过滤掉-[*!@]# skip URLs with slash-delimited segment that repeats 3+ times, to break loops#过滤掉一些特殊格式的链接-.*(/[^/]+)/[^/]+\1/[^/]+\1/# accept anything else#接受所有的链接,这里可以做自己的修改,是的只接受自己规定类型的链接
原因解释:因为爬取的公告链接为(http://www.online.sdu.edu.cn/news/article.php?pid=636514943),链接中含有?和=字符,所以被过滤特殊字符的正则表达式过滤掉,通过修改regex-urlfilter.txt配置文件(如上),最终可以爬取这类公告的链接。