码迷,mamicode.com
首页 > 移动开发 > 详细

Android(Java) 模拟登录知乎并抓取用户信息

时间:2015-08-10 13:41:37      阅读:246      评论:0      收藏:0      [点我收藏+]

标签:android   爬虫   cookie   知乎   模拟登录   

前不久,看到一篇文章我用爬虫一天时间“偷了”知乎一百万用户,只为证明PHP是世界上最好的语言,该文章中使用的登录方式是直接复制cookie到代码中,这里呢,我不以爬信息为目的。只是简单的介绍使用java来进行模拟登录的基本过程,之前写过的文章android 项目实战——打造超级课程表一键提取课表功能其实就是模拟登录的范畴。再加上最近在知乎上看到很多人问关于超级课程表的实现,其实本质就是模拟登录,掌握了这篇文章的内容,你不再担心抓不到信息了。然后,这篇文章会使用到之前的一篇Cookie保持的文章Android OkHttp的Cookie自动化管理,还有Jsoup的使用 Jsoup库使用完全解析,为了简单处理,直接使用javaSE来,而不再使用Android进行。如果要移植到Android,唯一的处理可能就是把网络请求工作扔到子线程中去 。

首先使用Chrome打开知乎首页 , 点击登录,你会看到下面这个界面
技术分享

在Chorme中按F12,调出开发者工具,切到Network选项卡,勾选Preserve Log,记得一定要勾选,不然你会看不到信息。

技术分享

一切就绪后,在输入框中输出账号密码点击登录,登录成功后你会看到这么一条记录

技术分享

点击图中的email,在最下方你会看到本次请求提交了4个参数,以及在上方,你会看到本次请求的地址是http://www.zhihu.com/login/email

技术分享

技术分享

你会惊讶的发现知乎的密码是明文传输的,提交的参数的意思也很简单,email就是账号,password就是密码,remember_me就是是否记住,这里传true就可以了,还有一个_xsrf参数,这个毛估估应该是防爬虫的。因此在提交前我们要从源代码中将这个值抓取下来。该值在表单的隐藏域中

技术分享

一切准备就绪后,你就兴高采烈的用代码去模拟登录,然后你会发现会返回一个验证码错误的信息。其实,我们还需要提交一个验证码,其参数名为captcha,验证码的地址为,

http://www.zhihu.com/captcha.gif?r=时间戳

于是我们得出了这样的一个数据。

  • 请求地址
http://www.zhihu.com/login/email
  • 请求参数
_xsrf 表单中提取的隐藏域的值
captcha 验证码
email 邮箱
password 密码
remember_me 记住我

还有一个问题,验证码的值怎么得到呢,答案是人工输入,将验证码保存到本地进行认为识别,输入后进行登陆即可。

这里的网络请求使用OkHttp,以及解析使用Jsoup,然后我们会使用到Gson,将他们加入maven依赖

    <dependencies>
        <dependency>
            <groupId>com.squareup.okhttp</groupId>
            <artifactId>okhttp</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.8.3</version>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.3.1</version>
        </dependency>
    </dependencies>

在编码之前,我们得想想怎么维持登陆状态,没错,就是Cookie如何保持,我们只进行登陆一次,后续都直接采集数据就可以了,因此需要将cookie持久化,对之前的文章中的一个Android类进行改造。使其变成java平台可用的类,可以看到我们将它从之前保存到SharePrefrences中改成了保存到文件中,并以json形式存储,这就是为什么会用到Gson的原因了

package cn.edu.zafu.zhihu;



import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.reflect.TypeToken;

import java.io.*;
import java.net.CookieStore;
import java.net.HttpCookie;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

/**
 * User:lizhangqu(513163535@qq.com)
 * Date:2015-07-18
 * Time: 16:54
 */
public class PersistentCookieStore implements CookieStore {
    private static final Gson gson= new GsonBuilder().setPrettyPrinting().create();
    private static final String LOG_TAG = "PersistentCookieStore";
    private static final String COOKIE_PREFS = "CookiePrefsFile";
    private static final String COOKIE_NAME_PREFIX = "cookie_";

    private final HashMap<String, ConcurrentHashMap<String, HttpCookie>> cookies;
    private  Map<String,String> cookiePrefs=new HashMap<String, String>();

    /**
     * Construct a persistent cookie store.
     *
     */
    public PersistentCookieStore() {
        String cookieJson = readFile("cookie.json");
        Map<String,String> fromJson = gson.fromJson(cookieJson,new TypeToken<Map<String, String>>() {}.getType());  
        if(fromJson!=null){
            System.out.println(fromJson);
            cookiePrefs=fromJson;
        }


        cookies = new HashMap<String, ConcurrentHashMap<String, HttpCookie>>();

        // Load any previously stored cookies into the store

        for(Map.Entry<String, ?> entry : cookiePrefs.entrySet()) {
            if (((String)entry.getValue()) != null && !((String)entry.getValue()).startsWith(COOKIE_NAME_PREFIX)) {
                String[] cookieNames = split((String) entry.getValue(), ",");
                for (String name : cookieNames) {
                    String encodedCookie = cookiePrefs.get(COOKIE_NAME_PREFIX + name);
                    if (encodedCookie != null) {
                        HttpCookie decodedCookie = decodeCookie(encodedCookie);
                        if (decodedCookie != null) {
                            if(!cookies.containsKey(entry.getKey()))
                                cookies.put(entry.getKey(), new ConcurrentHashMap<String, HttpCookie>());
                            cookies.get(entry.getKey()).put(name, decodedCookie);
                        }
                    }
                }

            }
        }
    }

    public void add(URI uri, HttpCookie cookie) {
        String name = getCookieToken(uri, cookie);

        // Save cookie into local store, or remove if expired
        if (!cookie.hasExpired()) {
            if(!cookies.containsKey(uri.getHost()))
                cookies.put(uri.getHost(), new ConcurrentHashMap<String, HttpCookie>());
            cookies.get(uri.getHost()).put(name, cookie);
        } else {
            if(cookies.containsKey(uri.toString()))
                cookies.get(uri.getHost()).remove(name);
        }
        cookiePrefs.put(uri.getHost(), join(",", cookies.get(uri.getHost()).keySet()));
        cookiePrefs.put(COOKIE_NAME_PREFIX + name, encodeCookie(new SerializableHttpCookie(cookie)));

        String json=gson.toJson(cookiePrefs);
        saveFile(json.getBytes(), "cookie.json");

    }

    protected String getCookieToken(URI uri, HttpCookie cookie) {
        return cookie.getName() + cookie.getDomain();
    }

    public List<HttpCookie> get(URI uri) {
        ArrayList<HttpCookie> ret = new ArrayList<HttpCookie>();
        if(cookies.containsKey(uri.getHost()))
            ret.addAll(cookies.get(uri.getHost()).values());
        return ret;
    }

    public boolean removeAll() {
        cookiePrefs.clear();
        cookies.clear();
        return true;
    }


    public boolean remove(URI uri, HttpCookie cookie) {
        String name = getCookieToken(uri, cookie);

        if(cookies.containsKey(uri.getHost()) && cookies.get(uri.getHost()).containsKey(name)) {
            cookies.get(uri.getHost()).remove(name);
            if(cookiePrefs.containsKey(COOKIE_NAME_PREFIX + name)) {
                cookiePrefs.remove(COOKIE_NAME_PREFIX + name);
            }
            cookiePrefs.put(uri.getHost(), join(",", cookies.get(uri.getHost()).keySet()));

            return true;
        } else {
            return false;
        }
    }

    public List<HttpCookie> getCookies() {
        ArrayList<HttpCookie> ret = new ArrayList<HttpCookie>();
        for (String key : cookies.keySet())
            ret.addAll(cookies.get(key).values());

        return ret;
    }

    public List<URI> getURIs() {
        ArrayList<URI> ret = new ArrayList<URI>();
        for (String key : cookies.keySet())
            try {
                ret.add(new URI(key));
            } catch (URISyntaxException e) {
                e.printStackTrace();
            }

        return ret;
    }

    /**
     * Serializes Cookie object into String
     *
     * @param cookie cookie to be encoded, can be null
     * @return cookie encoded as String
     */
    protected String encodeCookie(SerializableHttpCookie cookie) {
        if (cookie == null)
            return null;
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        try {
            ObjectOutputStream outputStream = new ObjectOutputStream(os);
            outputStream.writeObject(cookie);
        } catch (IOException e) {
            System.out.println("IOException in encodeCookie"+ e);
            return null;
        }

        return byteArrayToHexString(os.toByteArray());
    }

    /**
     * Returns cookie decoded from cookie string
     *
     * @param cookieString string of cookie as returned from http request
     * @return decoded cookie or null if exception occured
     */
    protected HttpCookie decodeCookie(String cookieString) {
        byte[] bytes = hexStringToByteArray(cookieString);
        ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
        HttpCookie cookie = null;
        try {
            ObjectInputStream objectInputStream = new ObjectInputStream(byteArrayInputStream);
            cookie = ((SerializableHttpCookie) objectInputStream.readObject()).getCookie();
        } catch (IOException e) {
            System.out.println("IOException in decodeCookie"+e);
        } catch (ClassNotFoundException e) {
            System.out.println("ClassNotFoundException in decodeCookie"+e);
        }

        return cookie;
    }

    /**
     * Using some super basic byte array &lt;-&gt; hex conversions so we don‘t have to rely on any
     * large Base64 libraries. Can be overridden if you like!
     *
     * @param bytes byte array to be converted
     * @return string containing hex values
     */
    protected String byteArrayToHexString(byte[] bytes) {
        StringBuilder sb = new StringBuilder(bytes.length * 2);
        for (byte element : bytes) {
            int v = element & 0xff;
            if (v < 16) {
                sb.append(‘0‘);
            }
            sb.append(Integer.toHexString(v));
        }
        return sb.toString().toUpperCase(Locale.US);
    }

    /**
     * Converts hex values from strings to byte arra
     *
     * @param hexString string of hex-encoded values
     * @return decoded byte array
     */
    protected byte[] hexStringToByteArray(String hexString) {
        int len = hexString.length();
        byte[] data = new byte[len / 2];
        for (int i = 0; i < len; i += 2) {
            data[i / 2] = (byte) ((Character.digit(hexString.charAt(i), 16) << 4) + Character.digit(hexString.charAt(i + 1), 16));
        }
        return data;
    }
    public static String join(CharSequence delimiter, Iterable tokens) {
        StringBuilder sb = new StringBuilder();
        boolean firstTime = true;
        for (Object token: tokens) {
            if (firstTime) {
                firstTime = false;
            } else {
                sb.append(delimiter);
            }
            sb.append(token);
        }
        return sb.toString();
    }
    public static String[] split(String text, String expression) {
        if (text.length() == 0) {
            return new String[]{};
        } else {
            return text.split(expression, -1);
        }
    }

    public static void saveFile(byte[] bfile, String fileName) {
        BufferedOutputStream bos = null;
        FileOutputStream fos = null;
        File file = null;
        try {
            file = new File(fileName);
            fos = new FileOutputStream(file);
            bos = new BufferedOutputStream(fos);
            bos.write(bfile);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (bos != null) {
                try {
                    bos.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
            if (fos != null) {
                try {
                    fos.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
        }
    }
    public static String readFile(String fileName) {
        BufferedInputStream bis = null;
        FileInputStream fis = null;
        File file = null;
        try {
            file = new File(fileName);
            fis = new FileInputStream(file);
            bis = new BufferedInputStream(fis);

            int available = bis.available();
            byte[] bytes=new byte[available];
            bis.read(bytes);
            String str=new String(bytes);
            return str;
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (bis != null) {
                try {
                    bis.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
            if (fis != null) {
                try {
                    fis.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
        }
        return "";
    }
}

然后新建一个OkHttp请求类,并设置其Cookie处理类为我们编写的类。

private static OkHttpClient client = new OkHttpClient();
client.setCookieHandler(new CookieManager(new PersistentCookieStore(), CookiePolicy.ACCEPT_ALL));

好了,可以开始获取_xsrf以及验证码了。验证码保存在项目根目录下名为code.png的文件

private static String xsrf;
public static void getCode() throws IOException{
        Request request = new Request.Builder()
        .url("http://www.zhihu.com/")
        .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
        .build();

        Response response = client.newCall(request).execute();
        String result = response.body().string();

        Document parse = Jsoup.parse(result);
        System.out.println(parse + "");
        result = parse.select("input[type=hidden]").get(0).attr("value")
                .trim();
        xsrf=result;
        System.out.println("_xsrf:" + result);
        String codeUrl = "http://www.zhihu.com/captcha.gif?r=";
        codeUrl += System.currentTimeMillis();
        System.out.println("codeUrl:" + codeUrl);
        Request getcode = new Request.Builder()
                .url(codeUrl)
                .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
                .build();

        Response code = client.newCall(getcode).execute();

        byte[] bytes = code.body().bytes();
        saveCode(bytes, "code.png");
    }
    public static void saveCode(byte[] bfile, String fileName) {
        BufferedOutputStream bos = null;
        FileOutputStream fos = null;
        File file = null;
        try {
            file = new File(fileName);
            fos = new FileOutputStream(file);
            bos = new BufferedOutputStream(fos);
            bos.write(bfile);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (bos != null) {
                try {
                    bos.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
            if (fos != null) {
                try {
                    fos.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
            }
        }
    }

然后将获取来的参数连同账号密码进行提交登录

    public static void login(String randCode,String email,String password) throws IOException{
        RequestBody formBody = new FormEncodingBuilder()
        .add("_xsrf", xsrf)
        .add("captcha", randCode)
        .add("email", email)
        .add("password", password)
        .add("remember_me", "true")
        .build();
        Request login = new Request.Builder()
        .url("http://www.zhihu.com/login/email")
        .post(formBody)
        .addHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
        .build();


        Response execute = client.newCall(login).execute();
        System.out.println(decode(execute.body().string()));

    }
public static String decode(String unicodeStr) {
        if (unicodeStr == null) {
            return null;
        }
        StringBuffer retBuf = new StringBuffer();
        int maxLoop = unicodeStr.length();
        for (int i = 0; i < maxLoop; i++) {
            if (unicodeStr.charAt(i) == ‘\\‘) {
                if ((i < maxLoop - 5)
                        && ((unicodeStr.charAt(i + 1) == ‘u‘) || (unicodeStr
                        .charAt(i + 1) == ‘U‘)))
                    try {
                        retBuf.append((char) Integer.parseInt(
                                unicodeStr.substring(i + 2, i + 6), 16));
                        i += 5;
                    } catch (NumberFormatException localNumberFormatException) {
                        retBuf.append(unicodeStr.charAt(i));
                    }
                else
                    retBuf.append(unicodeStr.charAt(i));
            } else {
                retBuf.append(unicodeStr.charAt(i));
            }
        }
        return retBuf.toString();
    }

当看到下面的信息就代码登录成功了

技术分享

之后你就可以获取你想要的信息了,这里简单获取一些信息,比如我要获取轮子哥的followers的昵称,分页自己处理下就ok了。

public static void getFollowers() throws IOException{
        Request request = new Request.Builder()
        .url("http://www.zhihu.com/people/zord-vczh/followees")
        .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
        .build();
        Response response = client.newCall(request).execute();

        String result=response.body().string();

        Document parse = Jsoup.parse(result);

        Elements select = parse.select("div.zm-profile-card");
        StringBuilder builder=new StringBuilder();
        for (int i=0;i<select.size();i++){
            Element element = select.get(i);
            String name=element.select("h2").text();
            System.out.println(name+"");
            builder.append(name);
            builder.append("\n");
        }
    }

下图就是获取到的信息。当然,只要你登录了,什么信息你都可以获取到。
技术分享

最后上源码,Intelij的maven项目
http://download.csdn.net/detail/sbsujjbcy/8984375

版权声明:本文为博主原创文章,未经博主允许不得转载。

Android(Java) 模拟登录知乎并抓取用户信息

标签:android   爬虫   cookie   知乎   模拟登录   

原文地址:http://blog.csdn.net/sbsujjbcy/article/details/47396659

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!