Skip to content

Commit e468b41

Browse files
committed
update file
1 parent bec3920 commit e468b41

21 files changed

+641
-375
lines changed

ECUT_pos_html.py

Lines changed: 0 additions & 57 deletions
This file was deleted.

README.md

Lines changed: 71 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# PythonCrawler: 用python编写的简单爬虫项目集合
1+
# PythonCrawler: 用python编写的爬虫项目集合
22
```
33
(
44
)\ ) ) ) ( (
@@ -9,30 +9,90 @@
99
| _/| || || _|| ' \ / _ \| ' \))| (__ | '_|/ _` |\ V V /| |/ -_) | '_|
1010
|_| \_, | \__||_||_|\___/|_||_| \___||_| \__,_| \_/\_/ |_|\___| |_|
1111
|__/
12-
by yanghangfeng
12+
——————by yanghangfeng
1313
```
1414

1515

16-
# 模块介绍
16+
# spiderFile模块简介
1717

18-
##### 1. [baiduImg.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/baiduImg.py): 抓取百度的‘高清摄影’图片
18+
##### 1. [baidu_sy_img.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/baiduImg.py): 抓取百度的‘高清摄影’图片
1919

20-
##### 2. [baiduImg2.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/baiduImg2.py): 抓取百度图片‘唯美意境’模块
20+
##### 2. [baidu_wm_img.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/baiduImg2.py): 抓取百度图片‘唯美意境’模块
2121

22-
##### 3. [GetPhotos2.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/GetPhotos2.py): 抓取百度贴吧某话题下的所有图片
22+
##### 3. [get_photos.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/GetPhotos2.py): 抓取百度贴吧某话题下的所有图片
2323

24-
##### 4. [getWebAllImg.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/getWebAllImg.py): 抓取整个网站的图片
24+
##### 4. [get_web_all_img.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/getWebAllImg.py): 抓取整个网站的图片
2525

26-
##### 5. [lagouPositionSpider.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/lagouPositionSpider.py): 任意输入关键字,一键抓取与关键字相关的职位招聘信息,并保存到本地文件
26+
##### 5. [lagou_position_spider.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/lagouPositionSpider.py): 任意输入关键字,一键抓取与关键字相关的职位招聘信息,并保存到本地文件
2727

2828
##### 6. [student_img.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/student_img.py): 基于本学校官网的url漏洞,获取所有注册学生学籍证件照
2929

30-
##### 7. [JDSpider.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/JDSpider.py): 大批量抓取京东商品id和标签
30+
##### 7. [JD_spider.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/JDSpider.py): 大批量抓取京东商品id和标签
3131

3232
##### 8. [ECUT_pos_html.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/ECUT_pos_html.py): 抓取学校官网所有校园招聘信息,并保存为html格式,图片也会镶嵌在html中。
3333

3434
##### 9. [ECUT_get_grade.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/ECUT_get_grade.py): 模拟登陆学校官网,抓取成绩并计算平均学分绩
3535

36-
##### 10. [githubHot.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/githubHot.py): 抓取github上面热门语言所对应的项目,并把项目简介和项目主页地址保存到本地文件。
36+
##### 10. [github_hot.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/githubHot.py): 抓取github上面热门语言所对应的项目,并把项目简介和项目主页地址保存到本地文件。
3737

38-
##### 11.[pictureSpider.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/pictureSpider.py): 应一位知友的请求,抓取某网站上面所有的写真图片。
38+
##### 11.[xz_picture_spider.py](https://github.com/Fenghuapiao/PythonCrawler/blob/master/pictureSpider.py): 应一位知友的请求,抓取某网站上面所有的写真图片。
39+
---
40+
# spiderAPI模块简介
41+
#### 本模块提供一些网站的API爬虫接口,功能可能不是很全因此可塑性很大智慧的你如果有兴趣可以继续改进。
42+
##### 1.大众点评
43+
```python
44+
from spiderAPI.dianping import *
45+
46+
'''
47+
citys = {
48+
'北京': '2', '上海': '1', '广州': '4', '深圳': '7', '成都': '8', '重庆': '9', '杭州': '3', '南京': '5', '沈阳': '18', '苏州': '6', '天津': '10','武汉': '16', '西安': '17', '长沙': '344', '大连': '19', '济南': '22', '宁波': '11', '青岛': '21', '无锡': '13', '厦门': '15', '郑州': '160'
49+
}
50+
51+
ranktype = {
52+
'最佳餐厅': 'score', '人气餐厅': 'popscore', '口味最佳': 'score1', '环境最佳': 'score2', '服务最佳': 'score3'
53+
}
54+
'''
55+
56+
result=bestRestaurant(cityId=1, rankType='popscore')#获取人气餐厅
57+
58+
shoplist=dpindex(cityId=1, page=1)#商户风云榜
59+
60+
restaurantlist=restaurantList('http://www.dianping.com/search/category/2/10/p2')#获取餐厅
61+
62+
```
63+
64+
##### 2.获取代理IP
65+
爬取http://proxy.ipcn.org,获取可用代理
66+
```python
67+
from spiderAPI.proxyip import get_enableips
68+
69+
enableips=get_enableips()
70+
71+
```
72+
73+
##### 3.百度地图
74+
百度地图提供的API,对查询有一些限制,这里找出了web上查询的接口
75+
```python
76+
from spiderAPI.baidumap import *
77+
78+
citys=citys()#获取城市列表
79+
result=search(keyword="美食", citycode="257", page=1)#获取搜索结果
80+
81+
```
82+
83+
##### 4.模拟登录github
84+
```python
85+
from spiderAPI.github import GitHub
86+
87+
github = GitHub()
88+
github.login() # 这一步会提示你输入用户名和密码
89+
github.show_timeline() # 获取github主页时间线
90+
# 更多的功能有待你们自己去发掘
91+
```
92+
93+
##### 5.拉勾网
94+
```python
95+
from spiderAPI.lagou import *
96+
97+
lagou_spider(key='数据挖掘', page=1) # 获取关键字为数据挖掘的招聘信息
98+
```

baiduImg.py

Lines changed: 0 additions & 53 deletions
This file was deleted.

baiduImg2.py

Lines changed: 0 additions & 48 deletions
This file was deleted.

spiderAPI/__init__.py

Whitespace-only changes.

spiderAPI/baidumap.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import requests
2+
import json
3+
4+
headers = {
5+
'Host': "map.baidu.com",
6+
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
7+
"Accept-Encoding": "gzip, deflate",
8+
"Accept-Language": "en-US,en;q=0.5",
9+
"Connection": "keep-alive",
10+
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0"}
11+
12+
13+
def citys():
14+
html = requests.get(
15+
'http://map.baidu.com/?newmap=1&reqflag=pcmap&biz=1&from=webmap&da_par=baidu&pcevaname=pc4.1&qt=s&da_src=searchBox.button&wd=美食&c=1&src=0&wd2=&sug=0&l=5&b=(7002451.220000001,1994587.88;19470675.22,7343963.88)&from=webmap&biz_forward={%22scaler%22:1,%22styles%22:%22pl%22}&sug_forward=&tn=B_NORMAL_MAP&nn=0&u_loc=12736591.152491,3547888.166124&ie=utf-8&t=1459951988807', headers=headers).text
16+
data = json.loads(html)
17+
result = []
18+
for item in data['more_city']:
19+
for city in item['city']:
20+
result.append(city)
21+
for item in data['content']:
22+
result.append(item)
23+
return result
24+
25+
26+
def search(keyword, citycode, page):
27+
html = requests.get('http://map.baidu.com/?newmap=1&reqflag=pcmap&biz=1&from=webmap&da_par=baidu&pcevaname=pc4.1&qt=con&from=webmap&c=' + str(citycode) + '&wd=' + keyword + '&wd2=&pn=' + str(
28+
page) + '&nn=' + str(page * 10) + '&db=0&sug=0&addr=0&&da_src=pcmappg.poi.page&on_gel=1&src=7&gr=3&l=12&tn=B_NORMAL_MAP&u_loc=12736591.152491,3547888.166124&ie=utf-8', headers=headers).text
29+
data = json.loads(html)['content']
30+
return data

spiderAPI/dianping.py

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
import requests
2+
import json
3+
import os
4+
from bs4 import BeautifulSoup
5+
6+
headers = {
7+
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0',
8+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
9+
'Accept-Language': 'en-US,en;q=0.5',
10+
'Accept-Encoding': 'gzip, deflate',
11+
'Connection': 'keep-alive'}
12+
13+
14+
def bestRestaurant(cityId=1, rankType='popscore'):
15+
html = requests.get('http://www.dianping.com/mylist/ajax/shoprank?cityId=%s&shopType=10&rankType=%s&categoryId=0' %
16+
(cityId, rankType), headers=headers).text
17+
result = json.loads(html)['shopBeans']
18+
return result
19+
20+
21+
def getCityId():
22+
citys = {'北京': '2', '上海': '1', '广州': '4', '深圳': '7', '成都': '8', '重庆': '9', '杭州': '3', '南京': '5', '沈阳': '18', '苏州': '6', '天津': '10',
23+
'武汉': '16', '西安': '17', '长沙': '344', '大连': '19', '济南': '22', '宁波': '11', '青岛': '21', '无锡': '13', '厦门': '15', '郑州': '160'}
24+
return citys
25+
26+
27+
def getRankType():
28+
RankType = {'最佳餐厅': 'score', '人气餐厅': 'popscore',
29+
'口味最佳': 'score1', '环境最佳': 'score2', '服务最佳': 'score3'}
30+
return RankType
31+
32+
33+
def dpindex(cityId=1, page=1):
34+
url = 'http://dpindex.dianping.com/dpindex?region=&category=&type=rank&city=%s&p=%s' % (
35+
cityId, page)
36+
html = requests.get(url, headers=headers).text
37+
table = BeautifulSoup(html, 'lxml').find(
38+
'div', attrs={'class': 'idxmain-subcontainer'}).find_all('li')
39+
result = []
40+
for item in table:
41+
shop = {}
42+
shop['name'] = item.find('div', attrs={'class': 'field-name'}).get_text()
43+
shop['url'] = item.find('a').get('href')
44+
shop['num'] = item.find('div', attrs={'class': 'field-num'}).get_text()
45+
shop['addr'] = item.find('div', attrs={'class': 'field-addr'}).get_text()
46+
shop['index'] = item.find('div', attrs={'class': 'field-index'}).get_text()
47+
result.append(shop)
48+
return result
49+
50+
51+
def restaurantList(url):
52+
html = requests.get(url, headers=headers, timeout=30).text.replace('\r', '').replace('\n', '')
53+
table = BeautifulSoup(html, 'lxml').find('div', id='shop-all-list').find_all('li')
54+
result = []
55+
for item in table:
56+
shop = {}
57+
soup = item.find('div', attrs={'class': 'txt'})
58+
tit = soup.find('div', attrs={'class': 'tit'})
59+
comment = soup.find('div', attrs={'class': 'comment'})
60+
tag_addr = soup.find('div', attrs={'class': 'tag-addr'})
61+
shop['name'] = tit.find('a').get_text()
62+
shop['star'] = comment.find('span').get('title')
63+
shop['review-num'] = comment.find('a',
64+
attrs={'class': 'review-num'}).get_text().replace('条点评', '')
65+
shop['mean-price'] = comment.find('a', attrs={'class': 'mean-price'}).get_text()
66+
shop['type'] = tag_addr.find('span', attrs={'class': 'tag'}).get_text()
67+
shop['addr'] = tag_addr.find('span', attrs={'class': 'addr'}).get_text()
68+
try:
69+
comment_list = soup.find('span', attrs={'class': 'comment-list'}).find_all('span')
70+
except:
71+
comment_list = []
72+
score = []
73+
for i in comment_list:
74+
score.append(i.get_text())
75+
shop['score'] = score
76+
tags = []
77+
try:
78+
for i in tit.find('div', attrs={'class': 'promo-icon'}).find_all('a'):
79+
try:
80+
tags += i.get('class')
81+
except:
82+
tags.append(i.get('class')[0])
83+
except:
84+
pass
85+
shop['tags'] = tags
86+
result.append(shop)
87+
return result

0 commit comments

Comments
 (0)