python3 requests简单爬虫以及分词并制作词云-老康的学习空间

现在学的东西很杂，很多时候要学的东西其实以前都写过，但是都忘了。现在回想起来，很多以前写的代码基本上就都没有保存下来，感觉有些可以。一方面不便于以后的查找和复习，另一方面也丢失了很多记录。所以打算以后的代码片段都尽力保存下来，并写在博客里。

这个是好几天前无聊写的，因为不想写算法。。就是爬去了知乎https://www.zhihu.com/question/27964933这个问题下面的所有答案，然后做了个分词，按照频率再做个词云。至于问什么是这个问题。。原谅我单身20年的怨念。。

得出来的词云是这个样子的：
python3 requests简单爬虫以及分词并制作词云

虽然和预想中不太一样，没有什么特别显著的结果，但还是很炫酷啊

代码分两部分，第一部分，爬取知乎这个问题下面的所有答案，爬知乎还是挺方便的，api基本都是json格式的。
代码如下：

import requests
import pymysql
from time import sleep
from random import gauss
conn = pymysql.connect(host,username,password,databaseName,use_unicode=True,charset='utf8')
sess = requests.Session()
mycookies = {}
myheaders = {}
with open('zhihu.conf','rb') as f:
    lines = f.readlines()
    for line in lines:
        line = line.decode('utf-8')
        print(line)
        k,v = line.split(':',1)
        k = k.strip()
        v = v.strip()
        if k.lower()!='cookie':
            pass
        else:
            cookies = v.split(';')
            for cookie in cookies:
                k,v = cookie.split('=',1)
                k = k.strip()
                v = v.strip()
                mycookies[k] = v
requests.utils.add_dict_to_cookiejar(sess.cookies,mycookies)

myheaders = {\
    'Host': 'www.zhihu.com',\
    'Referer': 'https://www.zhihu.com/',\
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0'}
step = 20
offset = 3677
cur = conn.cursor()
while True:
    r = sess.get("https://www.zhihu.com/api/v4/questions/27964933/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp,upvoted_followees;data[*].mark_infos[*].url;data[*].author.follower_count,badge[?(type=best_answerer)].topics&offset=%d&limit=20&sort_by=default" % offset,headers = myheaders)
    dataset = r.json().get('data')
    pointer = 0
    for data in dataset:
        name = data.get('author').get('name')
        url_token = data.get('author').get('url_token')
        voteup_count = data.get('voteup_count')
        comment_count = data.get('comment_count')
        content = data.get('content')
        print('offset',offset+pointer+1)
        print('name',name)
        print('url_token',url_token)
        print('voteup_count',voteup_count)
        print('comment_count',comment_count)
        print('content',content)
        print('--------------------------')
        pointer += 1
        query = "insert into ex2 (name,url_token,voteup_count,comment_count,content) values (%r,%r,%r,%r,%r)" % (name,url_token,voteup_count,comment_count,content)
        try:
            cur.execute(query)
            conn.commit()
        except Exception as e:
            print(e)
            conn.rollback()
            conn.close()
            raise e
    if len(dataset)!=step:
        print('length is not equal to the step : len(dataset) = '+str(len(dataset)))
        break
    offset += step
    sleep(5+gauss(3,1))

代码不是很精炼，因为是边调试边改的，调通了以后也没优化了（懒。。）
解释下代码，里面读取了一个叫zhihu.conf的文件，这个文件里面是这个样子滴：

python3 requests简单爬虫以及分词并制作词云
为什么要这样子呢，因为这段文本是从firefox里面的f12开发者模式里面直接拷贝下来的，我懒得手动添加cookie了，所以简单写了段程序来解析cookie并添加进session
虽然程序不难，但还是有几个坑，记录下来备忘：

pymysql连接的时候要加use_unicode=True,charset=’utf8’，否则会编码错误
请求头要注意，我原来用的是zhihu.conf里面的那种原封不动给了headers，结果不行。后来发现请求头不能太多，只要有host，referer，ua就行了。到底是为什么我不确定，没深究了，怀疑可能是Connection: keep-alive的问题
那个sleep的时间我用了个gauss就是弄着玩的，没测试是不是必要的
中间断了一次，手动给接上了，原因。。不知道，没去研究了
总之就这样拍下来大概6000多个答案,嗯看了一样数据库，是6011个答案。接下来就是第二部分，从数据库把所有的答案弄下来然后分词，并制作词云。这里边用到了三个模块。beautifulsoup用来解析知乎的答案内容（因为里面有很多html标签），jieba用来分词，wordcloud用来制作词云，代码如下：

from bs4 import BeautifulSoup
import json
import jieba
import jieba.posseg as posseg
from collections import Counter
from wordcloud import WordCloud
from PIL import Image
with open('ex2.json','r') as f:
    datas = f.readlines()
data = datas[-1]
data = json.loads(data)
r = ''
for item in data:
    content = item.get('content')
    soup = BeautifulSoup(content,'lxml')
    for i in soup.strings:
        i = i.strip()
        r += i
    r.replace('著作权归作者所有，禁止转载。','')
print('the length of r is: ',len(r))
words = [w for w,f in posseg.cut(r) if f[0]!='r' and len(w)>1]
c = Counter(words)
print(c.most_common(20))
wc = WordCloud(font_path='c:\\Windows\\Fonts\\simkai.ttf',height=1080,width=1920).generate_from_frequencies(c)
image = wc.to_image()
image.show()
wc.to_file("ex2.png")

其中用到了一个ex2.json,这个文件是我从phpmyadmin里面直接down下来的，格式长下面这样：
python3 requests简单爬虫以及分词并制作词云
因为基本上每个答案后面都被知乎自动加了一句著作权归作者所有，禁止转载，所以把这句话去掉，防止影响后面的频率统计
注意点：

wordCloud默认不支持中文的，要想支持的话一定要加上font_path参数，指定一个中文字体才行。否则会把中文显示成框框
我用了jieba.posseg.cut而不是直接jieba.cut，这样可以词性标注，从而排除一些不想要的词。我第一次没用词性标注，结果出来了很多代词，比如“我们”。。所以我排除了代词

189万多字，最后综合成了这样一张图。我不禁感叹：怪不得我找不到妹子 ╮(๑•́ ₃•̀๑)╭

python3 requests简单爬虫以及分词并制作词云

LeoKing

相关推荐

其他操作

随机推荐

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏