让Python爬虫去Apache溜达溜达


本文链接: https://51meaning.cn/blog/?p=593   转载请注明转载自:www.51meaning.cn

u=214830347,121788848&fm=26&gp=0本文分享一个简单的Python小爬虫子,用它可以在Apache网站上爬下Kafka 2.x.x所有版本。

一、为什么爬取
因为在Apache上直接下载某版本资源时速度较慢,因此小编写了个小爬虫,让她在那慢慢玩……

二、结果
经过七七十四天的努力,终于下完了Kafka 2.x.x的所有版本与Maven 3.x.x的所有Linux、Windows版本,并最终归存云盘,亦分享。
kafka-versionmaven-version获取方式如下:
Maven 3.x.x:https://download.csdn.net/download/u011378744/12346342
Kafka 2.x.x:https://download.csdn.net/download/u011378744/12413189

三、代码
最后分享爬取Kafka的Python代码(代码简陋还请多多指教),修改修改,亦可爬Maven。
import urllib.request
import re
import os
from urllib.parse import unquote
# open the url and read
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    page.close()
    return html

def getUrl(html):
    reg = r'(?:href|HREF)="?((?:https://www\.apache\.org/dist/kafka/)?.+?\.[0-9]\.tgz)"'
    url_re = re.compile(reg)
    url_lst = url_re.findall(html.decode('utf-8'))
    return(url_lst)

def getFile(url):
    print('地址'+url)
    u = urllib.request.urlopen(url)
    url=unquote(url,'utf-8')
    file_name = url.split('/')[-1]
    f = open(file_name, 'wb')

    block_sz = 1024
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        f.write(buffer)
        f.flush()
    f.close()
    print ("Sucessful to download" + " " + file_name)


root_url = 'https://downloads.apache.org/kafka/'
raw_url = 'https://downloads.apache.org/kafka/'
os.mkdir('kafka3')
os.chdir(os.path.join(os.getcwd(), 'kafka3'))

version = ["2.2.2","2.3.0","2.3.1","2.4.0","2.4.1","2.5.0"]
for v in version[:]:
    html = getHtml(raw_url+v)
    url_lst = getUrl(html)
   
    for url in url_lst[:]:
        url=root_url+v+"/"+url
        print(url)
        if url.find("src")<0:
            if url.find("?")<0:
                getFile(url)

3+
avatar