爬蟲爬取評價，python爬取微博評論（無重復數據）

2023-10-08 阅读 30 评论 0

摘要：python爬取微博評論（無重復數據）前言一、整體思路二、獲取微博地址1、獲取ajax地址2、解析頁面中的微博地址3、獲取指定用戶微博地址三、獲取主評論四、獲取子評論1、解析子評論2、獲取子評論五、主函數調用1、導入相關庫2、主函數執行3、結果寫在最后很多人學

python爬取微博評論（無重復數據）

前言
一、整體思路
二、獲取微博地址
1、獲取ajax地址2、解析頁面中的微博地址3、獲取指定用戶微博地址
三、獲取主評論
四、獲取子評論
1、解析子評論2、獲取子評論
五、主函數調用
1、導入相關庫2、主函數執行3、結果
寫在最后

很多人學習python，不知道從何學起。
很多人學習python，掌握了基本語法過后，不知道在哪里尋找案例上手。
很多已經做案例的人，卻不知道如何去學習更加高深的知識。
那么針對這三類人，我給大家提供一個好的學習平臺，免費領取視頻教程，電子書籍，以及課程的源代碼！
QQ群：961562169

Tip:本文僅供學習與交流，切勿用于非法用途！！！

前言

爬蟲爬取評價。前段時間微博上關于某日記的評論出現了嚴重的兩極分化，出于好奇的我想對其中的評論以及相關用戶做一個簡單的分析，于是我在網上找了相關的代碼，簡單的修改了cookies等參數就Run起來了。
既然沒報錯！！我很震驚，從未若此暢快過~

隨后便分析起來，過了一會就發現：事情并不簡單，數據是重復的！！

python爬蟲環境。

沒錯，重復率如此之高，令人發指。于是，我開始了我的探索之路~
說明：整體代碼還是借鑒了之前大佬的，主要是解決了數據重復的問題，如有侵權還請聯系！

一、整體思路

思路也比較清楚，我畫了一個極其簡單的流程圖：

python去除重復，

至于這里為什么主評論和子評論要分開獲取，這也是解決重復問題的關鍵，通過測試可以知道直接按照規律修改頁面參數或者通過.cn的頁面爬取，在到達一定的數量后（大概是幾百）就無法得到數據或是出現數據重復。

二、獲取微博地址

訪問微博用戶頁面時的請求URL為：

https://weibo.com/xxxxxxxxx?is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page=1

python代碼查重、其中通過修改page參數，即可控制頁碼，但是細心的小伙伴應該發現了，一頁的數據除了直接加載的HTML之外，還有兩次是通過ajax動態獲取的：

start_ajax_url1 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=0&pl_name=Pl_Official_MyProfileFeed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)
start_ajax_url2 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)

也就是說，每頁數據有三部分組成：

1、獲取ajax地址

python論文查重，通過主界面，獲取相應的ajax請求地址：

def get_ajax_url(user):url = 'https://weibo.com/%s?page=1&is_all=1'%userres = requests.get(url, headers=headers,cookies=cookies)html  = res.textpage_id = re.findall("CONFIG\['page_id'\]='(.*?)'",html)[0]domain = re.findall("CONFIG\['domain'\]='(.*?)'",html)[0]start_ajax_url1 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=0&pl_name=Pl_Official_MyProfileFeed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)start_ajax_url2 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=1&pl_name=Pl_Official_MyProfileFeed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)return start_ajax_url1,start_ajax_url2
123456789

2、解析頁面中的微博地址

發送請求后，解析頁面中的微博地址（主頁面請求或AJAX請求相同）：

def parse_home_url(url): res = requests.get(url, headers=headers,cookies=cookies)response = res.content.decode().replace("\\", "")every_id = re.compile('name=(\d+)', re.S).findall(response) # 獲取次級頁面需要的idhome_url = []for id in every_id:base_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={}&from=singleWeiBo'url = base_url.format(id)home_url.append(url)return home_url

3、獲取指定用戶微博地址

將上面兩個函數整合，得到：

def get_home_url(user,page): start_url = 'https://weibo.com/%s?page={}&is_all=1'%userstart_ajax_url1,start_ajax_url2 = get_ajax_url(user)for i in range(page): home_url = parse_home_url(start_url.format(i + 1)) # 獲取每一頁的微博ajax_url1 = parse_home_url(start_ajax_url1.format(i + 1)) # ajax加載頁面的微博ajax_url2 = parse_home_url(start_ajax_url2.format(i + 1)) # ajax第二頁加載頁面的微博all_url = home_url + ajax_url1 + ajax_url2print('第%d頁解析完成'%(i+1))return all_url

參數為用戶的ID，以及爬取的頁數，返回結果則為每條微博的地址。

三、獲取主評論

簡單分析請求數據可以知道，獲取微博評論的接口為：

https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4498052401861557&root_comment_max_id=185022621492535&root_comment_max_id_type=0&root_comment_ext_param=&page=1&from=singleWeiBo

一個很耀眼的page參數映入眼簾，而且似乎其他參數去掉之后請求也正常，也許你第一反應是寫個循環直接獲取不就OK了，emmmm，然后你就將陷入數據重復的恐怖原點。似乎root_comment_max_id這個參數也很重要，得想法獲得。通過進一步分析可以發現，其實在請求返回的數據中，已經包含了下一步請求的地址，只需要提取出來再繼續往下即可：

代碼如下：

def parse_comment_info(data_json): html = etree.HTML(data_json['data']['html'])name = html.xpath("//div[@class='list_li S_line1 clearfix']/div[@class='WB_face W_fl']/a/img/@alt")info = html.xpath("//div[@node-type='replywrap']/div[@class='WB_text']/text()")info = "".join(info).replace(" ", "").split("\n")info.pop(0)comment_time = html.xpath("//div[@class='WB_from S_txt2']/text()") # 評論時間name_url = html.xpath("//div[@class='WB_face W_fl']/a/@href")name_url = ["https:" + i for i in name_url]ids = html.xpath("//div[@node-type='root_comment']/@comment_id")    try:next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div/div/div[%d]/@action-data'%(len(name)+1))[0]+'&__rnd='+str(int(time.time()*1000))except:try:next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div/div/a/@action-data')[0]+'&__rnd='+str(int(time.time()*1000))except:next_url = ''comment_info_list = []for i in range(len(name)): item = {}item["id"] = ids[i]item["name"] = name[i] # 存儲評論人的網名item["comment_info"] = info[i][1:] # 存儲評論的信息item["comment_time"] = comment_time[i] # 存儲評論時間item["comment_url"] = name_url[i] # 存儲評論人的相關主頁try:action_data = html.xpath("/html/body/div/div/div[%d]//a[@action-type='click_more_child_comment_big']/@action-data"%(i+1))[0]child_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&' + action_dataitem["child_url"] = child_urlexcept:item["child_url"] = ''    comment_info_list.append(item)return comment_info_list,next_url

參數為請求的json數據，返回解析后的數據，以及下一個地址，數據格式如下：

其中child_url即為相對應子評論地址，進一步獲取子評論。

四、獲取子評論

獲取子評論的思路和獲取主評論的思路是一致的，當我們獲取完所有主評論之后，便遍歷結果，當child_url不為空時（即有子評論），則進行請求獲取子評論。

1、解析子評論

def parse_comment_info_child(data_json): html = etree.HTML(data_json['data']['html'])name = html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/a[1]/text()")info=html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/text()")info = "".join(info).replace(" ", "").split("\n")info.pop(0)comment_time = html.xpath("//div[@class='WB_from S_txt2']/text()") # 評論時間name_url = html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/a[1]/@href")name_url = ["https:" + i for i in name_url]ids = html.xpath("//div[@class='list_li S_line1 clearfix']/@comment_id")try:next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div[%d]/div/a/@action-data'%(len(name)+1))[0]+'&__rnd='+str(int(time.time()*1000))except:next_url = ''comment_info_list = []for i in range(len(name)): item = {}item["id"] = ids[i]item["name"] = name[i] # 存儲評論人的網名item["comment_info"] = info[i][1:] # 存儲評論的信息item["comment_time"] = comment_time[i] # 存儲評論時間item["comment_url"] = name_url[i] # 存儲評論人的相關主頁comment_info_list.append(item)return comment_info_list,next_url

2、獲取子評論

整合調用上一個函數，獲得相應的子評論：

def get_childcomment(url_child):print('開始獲取子評論...')comment_info_list = []res = requests.get(url_child, headers=headers, cookies=cookies)data_json = res.json()count = data_json['data']['count']comment_info,next_url = parse_comment_info_child(data_json)comment_info_list.extend(comment_info)print('已經獲取%d條'%len(comment_info_list))while len(comment_info_list) < count:if next_url == '':breakres = requests.get(next_url,headers=headers,cookies=cookies)data_json = res.json()comment_info,next_url = parse_comment_info_child(data_json)comment_info_list.extend(comment_info)print('已經獲取%d條'%len(comment_info_list))return comment_info_list

參數為child_url，返回為對應的子評論。

五、主函數調用

1、導入相關庫

import re
import time
import json
import urllib
import requests
from lxml import etree
123456

2、主函數執行

if "__main__" == __name__: # 設置相應參數headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0','Accept': '*/*','Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2','Content-Type': 'application/x-www-form-urlencoded','X-Requested-With': 'XMLHttpRequest','Connection': 'keep-alive',}cookies = {} # 微博cookies(需要自己獲取請求得到)userid = ''  # 需要爬取的微博用戶IDpage = 1     # 爬取的頁數# 開始爬取all_urls = get_home_url(userid,page)for index in range(len(all_urls)):url = all_urls[index] print('開始獲取第%d個微博主評論...'%(index+1))comment_info_list = []res = requests.get(url, headers=headers, cookies=cookies)data_json = res.json()count = data_json['data']['count']comment_info,next_url = parse_comment_info(data_json)comment_info_list.extend(comment_info)print('已經獲取%d條'%len(comment_info_list))while True:if next_url == '':breakres = requests.get(next_url,headers=headers,cookies=cookies)data_json = res.json()comment_info,next_url = parse_comment_info(data_json)comment_info_list.extend(comment_info)print('已經獲取%d條'%len(comment_info_list))for i in range(len(comment_info_list)):child_url = comment_info_list[i]['child_url']if child_url != '':comment_info_list[i]['child'] = get_childcomment(child_url)else:comment_info_list[i]['child'] = []with open('第%d條微博評論.txt'%(index+1),'w') as f:f.write(json.dumps(comment_info_list))