java爬取百度首页logo

时间：2014-09-24 10:59:26 阅读：212 评论：0 收藏：0 [点我收藏+]

标签：style http color io java ar strong div sp

两个方法
- 一个获得Url的网页源代码getUrlContentString，另外一个从源代码中得到想要的地址片段，其中需要用到正则表达式去匹配
得到网页源代码的过程：
- 地址为string，将地址转换为java中的url对象
- url的openConnection方法返回urlConnection
- urlConnection的connect方法建立连接
- 新建一个InputStreamReader对象，其中InputStreamReader的构建需要InputStream输入流对象，而URLConnection的getInputStream方法则返回输入流对象，所以可以连接起来
- 然后利用建立好的InputStreamReader对象建立BuffereReader对象
- 从bufferedreader对象中按行读入网页源码，追加到result字符串中，result字符串即为网页源代码字符串
logo地址匹配
- ?Pattern pattern = Pattern.compile(patternString);
  - java.util.regex：java类库包，用正则表达式所定义的模式对字符串进行匹配
    它包括两个类：Pattern和Matcher 。
    Pattern：创建匹配模式字符串。
    Matcher：将匹配模式字符串与输入字符串。
  - pattern的compile方法：将指定的字符编译到模式中
- Matcher matcher = pattern.matcher(contentString);

? ?

package com.test;

? ?

import java.io.*;

import java.net.*;

import java.util.regex.*;

? ?

public class baidulogo {

????static String getUrlContentString(String urlString) throws Exception {

????????String result = "";

????????URL url = new URL(urlString);

????????URLConnection urlConnection = url.openConnection();

????????urlConnection.connect();

????????InputStreamReader inputStreamReader = new InputStreamReader(

????????????????urlConnection.getInputStream(), "utf-8");

????????BufferedReader in = new BufferedReader(inputStreamReader);

????????String line;

????????while ((line = in.readLine()) != null) {

????????????result += line;

????????}

????????return result;

????}

? ?

????static String getLogoUrl(String contentString, String patternString) {

????????String LogoUrl = null;

????????Pattern pattern = Pattern.compile(patternString);

????????Matcher matcher = pattern.matcher(contentString);

????????if (matcher.find()) {

????????????LogoUrl = matcher.group(1);

????????}

????????return LogoUrl;

? ?

????}

? ?

????public static void main(String[] args) throws Exception {

????????// 定义即将访问的链接

????????String urlString = "http://www.baidu.com";

????????String result = getUrlContentString(urlString);

????????String patternString = "src=\"(.+?)\"";

????????String contentString = result;

????????String logoUrl = getLogoUrl(contentString, patternString);

????????System.out.println(logoUrl);

????}

}

java爬取百度首页logo

标签：style http color io java ar strong div sp

原文地址：http://www.cnblogs.com/keedor/p/3989762.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行