Skip to content

Commit 416488e

Browse files
committed
add pictureSpider.py
1 parent 8073a22 commit 416488e

File tree

2 files changed

+46
-0
lines changed

2 files changed

+46
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,5 @@
1919
##### 9. [ECUT_get_grade.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/ECUT_get_grade.py): 模拟登陆学校官网,抓取成绩并计算平均学分绩
2020

2121
##### 10. [githubHot.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/githubHot.py): 抓取github上面热门语言所对应的项目,并把项目简介和项目主页地址保存到本地文件。
22+
23+
##### 11.[pictureSpider.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/pictureSpider): 应一位知友的请求,抓取某网站上面所有的写真图片。

pictureSpider.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
import re
2+
import os
3+
import requests
4+
5+
6+
def Spidermain(page=11):
7+
'''
8+
本爬虫的爬取策略为深度优先(DFS)
9+
'''
10+
main_url_ = 'http://www.rosiok.com/app/list_12_{0}.html'
11+
for _ in range(1, page+1):
12+
main_url = main_url_.format(_)
13+
domain_url = 'http://www.rosiok.com{0}'
14+
start_html = requests.get(main_url).content.decode('gb2312')
15+
kids_url_regex = re.compile('<strong><a href=\'(.*?)\'>')
16+
kids_url = [domain_url.format(i) for i in re.findall(kids_url_regex, start_html)]
17+
for kid_url in kids_url:
18+
all_pic_urls = []
19+
pic_html = requests.get(kid_url).content.decode('gb2312')
20+
# 抓取标题
21+
title_regex = re.compile('<title>(.*?)</title>')
22+
title = re.findall(title_regex, pic_html)[0]
23+
# 抓取封面图片url
24+
parent_pic_regex = re.compile('<img src="(.*?)" width="796" height="531" alt')
25+
parent_pic = re.findall(parent_pic_regex, pic_html)
26+
# 抓取封面所对应的子图片url
27+
kids_pic_regex = re.compile('class="a" src="(.*?)" />')
28+
kids_pic_url = re.findall(kids_pic_regex, pic_html)
29+
# 合并封面url列表和子图url列表
30+
all_pic_urls.extend(parent_pic)
31+
all_pic_urls.extend(kids_pic_url)
32+
# 下载并存储图片
33+
if not os.path.exists('./{0}'.format(title)):
34+
os.mkdir('./{0}'.format(title))
35+
s = requests.Session()
36+
for count, pic_url in enumerate(all_pic_urls):
37+
with open('./{0}/{1}.jpg'.format(title, count), 'wb') as file:
38+
try:
39+
file.write(s.get(pic_url, timeout=5).content)
40+
except:
41+
pass
42+
43+
if __name__ == '__main__':
44+
Spidermain()

0 commit comments

Comments
 (0)