爬虫抓取网页数据(网页中有哪些URL，然后不断重复的重复了？)

优采云发布时间: 2021-12-30 12:09

　　获取一个网站的所有网址，思路很简单，就是反复分析新获取的页面中有哪些网址，然后再重复一遍。

　　下面是抓CSDN的例子：首先是一些辅助功能：

　　 1 def getResponse(url):# 使用requests获取Response

2 headers = {

3 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'

4 }

5 response = requests.get(url=url, headers=headers)

6 return response

7

8 def getHTMLBySelenium(url):# 使用selenium获取页面的page_text

9 try:

10 chrome_options =Options()

11 chrome_options.add_argument('--headless')

12 browser = webdriver.Chrome(executable_path='C:/Users/Administrator/Desktop/chromedriver_win32/chromedriver.exe', options=chrome_options)

13 browser.get(url)

14 time.sleep(2)

15 page_text = browser.page_source

16 browser.quit()

17 return page_text

18 except Exception as e:

19 return ''

20

21 def getBlog(url):# 获取页面内容

22 try:

23 page_text = getHTMLBySelenium(url)

24 tree = etree.HTML(page_text)

25 allText = tree.xpath('//body//text()')

26 text = '\n'.join(allText)

27 title = url.replace('/', '_')

28 title = title.replace('.', '_')

29 title = title.replace(':', '_')

30 with open('全站/' + title + '.txt', 'w', encoding='utf-8') as fp:

31 fp.write(text)

32 except Exception as e:

33 return

　　提取一个页面收录

的所有其他页面的URL，具体分析具体的网站，这里是CSDN的获取方式：

　　def getLinks(url):

try:

page_text = getHTMLBySelenium(url)

tree = etree.HTML(page_text)

all_href = tree.xpath('//a')

links = []

for href in all_href:

link = href.xpath('./@href')

if len(link) == 0:

continue

link = link[0]

if link.startswith('https://blog.csdn.net'):

links.append(link)

return links

except Exception as e:

return []

　　下面是递归获取页面URL的过程，先看一段简单的代码：

　　urls = set()# 存储已经被操作过的URL

temp1 = set()# 存储正在被操作的URL

temp2 = set()# 存储新获取的URL

temp1.add('url')# 程序最开始的分析的页面，可以是网站首页URL

while temp1:# temp1不为空则程序一直运行

for url in temp1:

if url in urls:# url在urls 代表这条url已经被处理

continue

doSomeThing(url)# 处理url

for link in getLinks(url):# 分析url表示的页面中有哪些其他的URL

if link in urls:

continue

if link in temp2:

continue

temp2.add(link)

# temp1中url处理完毕

# 将temp2内容赋给temp1，并清空temp2

temp1 = temp2.copy()

temp2.clear()

　　从上面的代码可以看出整个程序的运行逻辑，但是在具体使用中需要注意一些问题：

　　首先，我们用什么来保存获得的链接，我首先使用set，分别使用了一个文本文件用于urls，temp1和temp2。

　　复制，因为不知道程序在哪个节点会出问题。保存为文本后，可以避免从头开始运行代码的问题。这也是

　　这就是为什么我使用 try...except... 作为上述辅助功能的原因。

　　按照上面的思路，我完成了第一个版本的代码，set+text file，周末程序跑了两天后，发现程序有

　　脑内存满了（win10+16G内存），电脑卡死了，然后强制关机重启后，我看了一下存放URL的文件，程序是最多的

　　外循环大约运行了第四次，temp2中有几十万个URL。

　　既然内存不够，那我想把url存到数据库里，然后我选择用mysql而不是set来存url，还是用text

　　备份。

　　下面是这个版本的代码。如果程序运行两天没有内存问题，本文不再更新：

　　# ---- 用pymysql 操作数据库

def get_connection():

conn = pymysql.connect(host=host, port=port, db=db, user=user, password=password)

return conn

#打开数据库连接

conn = get_connection()

　　cnt = 1

　　循环 = 2

　　游标 = conn.cursor()

　　cursor1 = conn.cursor()

　　cursor2 = conn.cursor()

　　while True:

print(f'Loop {loop}')

loop += 1

# 遍历temp1

cursor1.execute("select * from csdn_temp1")

while True:

temp1Res = cursor1.fetchone()

# temp1 遍历完成

if temp1Res is None:

#表示已经取完结果集

break

print (temp1Res)

url = temp1Res[0]

url = re.sub('[\u4e00-\u9fa5]', '', url)

cursor.execute("select * from csdn_urls where url = %s", [url])

urlsRes = cursor.fetchone()

# 已经抓过这一条链接 continue

if urlsRes is not None:

continue

#if cnt % 100 == 0:

#print(url)

cnt += 1

sql = "insert ignore into csdn_urls values(%s)"

cursor.execute(sql,(url))

conn.commit()

with open('urls.txt', 'a', encoding='utf-8') as fp:

fp.write(url)

fp.write('\n')

getBlog(url)

links = getLinks(url)

#toTemp2Urls = []

for link in links:

# 已经抓过这一条链接或者 temp2 已经有了这一链接 continue

cursor.execute("select * from csdn_urls where url = %s", [link])

urlsRes = cursor.fetchone()

if urlsRes is not None:

continue

cursor2.execute("select * from csdn_temp2 where url = %s", [link])

temp2Res = cursor2.fetchone()

if temp2Res is not None:

continue

#toTemp2Urls.append(link)

sql = "insert ignore into csdn_temp2 values(%s)"

link = re.sub('[\u4e00-\u9fa5]', '', link)

cursor2.execute(sql,(link))

conn.commit()

with open('temp2.txt', 'a', encoding='utf-8') as fp:

fp.write(link)

fp.write('\n')

#sql="insert ignore into csdn_temp2 values(%s)"

#cursor2.executemany(sql,toTemp2Urls)

conn.commit()

#toTemp2Urls = []

conn.commit()

cursor.execute("rename table csdn_temp1 to csdn_temp")

conn.commit()

cursor.execute("rename table csdn_temp2 to csdn_temp1")

conn.commit()

cursor.execute("rename table csdn_temp to csdn_temp2")

conn.commit()

# 删除temp2数据

cursor.execute("delete from csdn_temp2")

conn.commit()

os.rename('temp1.txt', 'temp3.txt')

os.rename('temp2.txt', 'temp1.txt')

os.rename('temp3.txt', 'temp2.txt')

with open('temp2.txt', 'w', encoding='utf-8') as fp:

fp.write('')

　　在写上面的代码时，我遇到了一个问题。表改名后没有及时commit，清空了我在第一个版本抓到的几十万个网址。

　　它是空的，用于备份的文本文件也被清空。修改后得到上述代码。

　　整个代码的调试过程和写代码的思路可以在我的GitHub上的jupyter文件中找到。

0

2021-12-30

爬虫抓取网页数据

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

爬虫抓取网页数据(网页中有哪些URL，然后不断重复的重复了？)

0 个评论

发起人

AI时代内容工厂

爬虫抓取网页数据(网页中有哪些URL，然后不断重复的重复了？)

0 个评论

发起人

相关问题