寻找与网页内容相关的图片（二）reddit的做法

时间：2015-04-28 22:27:34 阅读：279 评论：0 收藏：0 [点我收藏+]

标签：

正如前文所述，内容聚合网站，比如新浪微博、推特、facebook等网站对于网页的缩略图是刚需。为了让分享的内容引人入胜，网页的图片缩略图是必不可少的。年轻人的聚集地、社交新闻网站reddit也是一个这样的网站，由于他们将自己网站的源代码在github上开源，我便很容易了解他们的做法。

寻找网页图片缩略图的算法，可以在这里找到：https://github.com/reddit/reddit/blob/0fbea80d45c4ce35e50ae6f8b42e5e60d79743ca/r2/r2/lib/media.py

实现这一功能的就是_find_thumbnail_image(self)函数

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。

content_type, content = _fetch_url(self.url)

# if it‘s an image. it‘s pretty easy to guess what we should thumbnail.
if content_type and "image" in content_type and content:
	return self.url

if content_type and "html" in content_type and content:
	soup = BeautifulSoup.BeautifulSoup(content)
else:
	return None

_fetch_url会请求链接url，获取链接文件类型，和链接的内容。
可以从_fetch_url函数看到，文件的类型是通过，http响应的头部获取的。文件类型由多用途互联网邮件扩展类型(Multipurpose Internet Mail Extensions,MIME)指定。
如果url指向文件是图片(image)就直接返回url，如果指向的文件是超文本标记语言(HTML, hypertext markup language)就用BeautifulSoup包对HTML源代码解析，如果是其它文件类型返回None。

# allow the content author to specify the thumbnail:
# <meta property="og:image" content="http://...">
og_image = (soup.find(‘meta‘, property=‘og:image‘) or
			soup.find(‘meta‘, attrs={‘name‘: ‘og:image‘}))
if og_image and og_image[‘content‘]:
	return og_image[‘content‘]

# <link rel="image_src" href="http://...">
thumbnail_spec = soup.find(‘link‘, rel=‘image_src‘)
if thumbnail_spec and thumbnail_spec[‘href‘]:
	return thumbnail_spec[‘href‘]

接下来判断，用户（网页的作者）是否指定缩略图。使用的方法便是前文所说的开放图谱计划(Open Graph Protocol)

meta标签或者是link便签可以指定网页的缩略图，如果网页包含这两个标签就大功告成了，直接返回图片的源地址即可。这样很方便，但有明显的不足。如此没有检验图片是否有效，有的网站偷工减料返回的并非网页相关图片的缩略图，而是网站的logo，stackoverflow就是一个典型。不过话又说回来，出现这种特殊情况的概率是相当小的。

# ok, we have no guidance from the author. look for the largest
# image on the page with a few caveats. (see below)
max_area = 0
max_url = None
for image_url in self._extract_image_urls(soup):
	# When isolated from the context of a webpage, protocol-relative
	# URLs are ambiguous, so let‘s absolutify them now.
	if image_url.startswith(‘//‘):
		image_url = coerce_url_to_protocol(image_url, self.protocol)
	size = _fetch_image_size(image_url, referer=self.url)
	if not size:
		continue

	area = size[0] * size[1]

接下来是一个循环，在通过_extract_image_urls找到网页的所有图片后，遍历每一张图片，找到最大的一张图片。

具体来说还加上了一些限制条件

# ignore little images
if area < 5000:
    g.log.debug(‘ignore little %s‘ % image_url)
    continue

# ignore excessively long/wide images
if max(size) / min(size) > 1.5:
    g.log.debug(‘ignore dimensions %s‘ % image_url)
    continue

# penalize images with "sprite" in their name
if ‘sprite‘ in image_url.lower():
    g.log.debug(‘penalizing sprite %s‘ % image_url)
    area /= 10

图片的面积必须大于5000像素、宽长比必须小于1.5、url如果包含sprite，则进行惩罚，将面积除以10

_fetch_image_size(image_url, referer=self.url)是一个比较困难的地方，为了找到每一张图片的大小，必须对下载图片。一个小技巧是，图片的大小作为图片文件格式的一部分往往写在了图片文件的头部，只需要下载图片的一部分就可以得到大小了。想要具体了解可以分析一下那个函数。

if area > max_area:
	max_area = area
	max_url = image_url

到这就结束了。reddit的方法用一句话来总结就是，相信网页指定的缩略图，没有就找最大的图片，同时限制最小面积以及宽长比。

这是它们实际的效果：

技术分享

寻找与网页内容相关的图片（二）reddit的做法

标签：

原文地址：http://www.cnblogs.com/meelo/p/4464027.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行