# web-crawler

**Repository Path**: xunknown/web-crawler

## Basic Information

- **Project Name**: web-crawler
- **Description**: 网页爬虫，拉取linux patch等
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-12-07
- **Last Updated**: 2022-09-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 说明  
## 功能用途  
从[https://lore.kernel.org](https://lore.kernel.org)网站爬取linux patch。  
1、指定邮箱列表，例如linux-arm-kernel  
2、指定查询条件执行查询操作，例如'd:3.days.ago..'  
该脚本基于上面查询操作的结果(查询的补丁邮件使用默认的排序方式，即按时间降序排列)爬取linux patch的邮件(thread)并找到各个邮件的根节点，并以根节点作为一组邮件的series输出。
后续只要打开该series页面，就可以看到该series的所有邮件。  

## 关键依赖  
该脚本使用selenium和lxml.etree 爬取和解析网页数据，使用pandas和styleframe将数据输出到excel文件。selenium的webdriver依赖浏览器及其对应的webdriver工具（默认使用chrome浏览器）。
因此系统上要先安装这些依赖包和工具，并安装webdriver工具对应版本的浏览器。  
Chrome浏览器的web driver（chromedriver.exe），可以在下面网址访问：  
[http://npm.taobao.org/mirrors/chromedriver/](http://npm.taobao.org/mirrors/chromedriver/)  
Firefox（火狐浏览器）的web driver （geckodriver.exe）在这里访问：  
[https://github.com/mozilla/geckodriver/releases](https://github.com/mozilla/geckodriver/releases)  
其他浏览器驱动可以见下面列表:  
Edge: [https://developer.microsoft.com/en-us/micrsosft-edage/tools/webdriver](https://developer.microsoft.com/en-us/micrsosft-edage/tools/webdriver)  
Safari: [https://webkit.org/blog/6900/webdriver-support-in-safari-10/](https://webkit.org/blog/6900/webdriver-support-in-safari-10/)  
Selenium: [https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/](https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/)  

## 使用方法  
可以执行lore-kernel-org.py -h命令查看帮助，例如:  
```
usage: lore-kernel-org.py [-h] [-l [LOG]] [-i INBOX] [-q QUERIES] [-d FROM_DAYS_AGO] [-m MAX_THREADS] [-g] [-c [CATEGORY]]
                          [-o [OUTPUT]] [-b {chrome,firefox}] [-a ATTRIBUTES]

Search and crawl linux patch series from https://lore.kernel.org/. The default config can be defined in default.json.

optional arguments:
  -h, --help            show this help message and exit
  -l [LOG], --log [LOG]
                        specify the log file (default: )
  -i INBOX, --inbox INBOX
                        search linux patch threads from which inbox (default: linux-arm-kernel)
  -q QUERIES, --queries QUERIES
                        search linux patch threads by what queries (default: d:7.days.ago..)
  -d FROM_DAYS_AGO, --from-days-ago FROM_DAYS_AGO
                        search linux patch threads by d:{}.days.ago.. and overwrite --queries (default: None)
  -m MAX_THREADS, --max-threads MAX_THREADS
                        how many threads will be crawled at most (default: 1000)
  -g, --ignore-reply    ignore reply threads, which subject start with "Re:" (default: False)
  -c [CATEGORY], --category [CATEGORY]
                        specify the category JSON file (default: category.json)
  -o [OUTPUT], --output [OUTPUT]
                        save series to which excel file (default: )
  -b {chrome,firefox}, --browser {chrome,firefox}
                        specify the browser (default: chrome)
  -a ATTRIBUTES, --attributes ATTRIBUTES
                        specify the attributes file (default: )
```
几点说明：
* inbox：指定邮箱名称，参考[https://lore.kernel.org](https://lore.kernel.org)
* queries: 指定查询条件，参考[https://lore.kernel.org/all/_/text/help/](https://lore.kernel.org/all/_/text/help/)
* from-days-ago: 用于简化指定"d:{}.days.ago.."查询条件，参数即为查询条件中的"{}"
* max-threads: 为了避免爬取多过补丁邮件，可以通过该选项参数限定最多爬取的邮件数量。设为-1则不限制数量。
* ignore-reply: 不分析以标题'Re:'开头的回复邮件，减少重复分析的邮件，但可能会遗漏部分邮件。
* category: 将爬取的补丁分类。如果不指定文件名则不分类。下面详细介绍。
* output: series默认自动保存到inbox-datetime.xlsx文件，如果指定该选项但不给出参数则不保存excel文件。
* log：日志默认保存到inbox-datetime.log文件，只记录最后一次运行的日志，如果指定该选项但不给出参数则不保存log文件。
* browser: 可以指定使用Chrome还是Firefox浏览器，默认Chrome。
* attributes: 可以指定一个属性文件，以简化命令行参数输入。如果命令行指定了选项，仍然会覆盖属性文件给出的属性。

可以通过default.json文件修改默认值，其中key为命令行长选项名称(或者属性名称，即key名称中的横线'-'也可以使用下划线'_')。例如：  
```
{
"log": "",
"inbox": "linux-arm-kernel",
"queries": "d:7.days.ago..",
"max-threads": 1000,
"ignore-reply": false,
"category": "category.json",
"output": "",
"browser": "chrome"
}
```

attributes文件也是一个json文件，类似default.json。例如：
```
{
"log": "",
"inbox": "linux-riscv",
"queries": "d:7.days.ago..",
"max-threads": 1000,
"ignore-reply": false,
"category": "category.json",
"output": "",
"browser": "chrome"
}
```
default.json或attributes文件完整的key可以参考-h选项打印的帮助信息或者日志的第一行打印信息，例如：  
```
[2022-01-15 23:26:43,924][INFO] Arguments: {'log': 'linux-arm-kernel-2022-01-15-23-26-43.log', 'inbox': 'linux-arm-kernel', 'queries': 'd:7.days.ago..', 'from_days_ago': None, 'max_threads': 1000, 'ignore_reply': False, 'category': 'category.json', 'output': 'linux-arm-kernel-2022-01-15-23-26-43.xlsx', 'browser': 'chrome', 'attributes': 'linux-arm-kernel.json'}
```

示例：  
```
使用默认查询参数，根据category.json的配置分类series，并输出到excel文件：  
./lore-kernel-org.py  
或者根据指定的属性文件查询补丁并输出到excel文件：  
./lore-kernel-org.py -a linux-arm-kernel.json
```

## 补丁分类  
补丁可以根据--category选项指定的JSON文件的配置信息进行分类。  
该文件的结构如下：
```
# 分类规则：
# 所有类型都不能包含common部分的exclude字符串
# 但都要包含common部分的include字符串
# 对应类型不能包含对应的exclude字符串
# 但要包含对应的include字符串
# 字符串忽略大小写
# exclue和include字段配置为空或不配置则表示忽略该字段
{
"common": {
	"exclude": "(\\W)(kvm|dts)(\\W|$)",
	"include": ".*"
},
"category": {
	"irq": {
		"exclude": "",
		"include": "(\\W)(irq|interrupt)s?(\\W|$)"
	},
	"cpu": {
		"exclude": "",
		"include": "cpu"
	}
}
}

```
该文件支持#开头（可以有前导空白字符）的行注释。  
exclude和include字段的内容是一个正则表达式，exclude表示不包含该模式串，include表示要包含该模式串，两者同时满足才会被分类。  
所有类型都要满足common部分的模式串要求，category是具体的类型，对应类型要满足对应的模式串要求。  
分类时，从邮件的标题(Subject部分)查询这些模式串（不区分大小写）。  
以示例的文件说明分类：
* 如果邮件标题包含kvm或者dts，则不根据该补丁分类对应的补丁系列（可以根据其他补丁再次分类对应的补丁系列）。
* 邮件标题包含irq或者gic，则补丁序列归类到irq。
* 邮件标题包含cpu，则补丁系列归类到cpu。
* 一个补丁系列可以归类到多个类。
* 可以使用classify-series.py脚本对抓取到的数据重新分类
  
## 正则表达式HOWTO  
参考：[https://docs.python.org/zh-cn/3/howto/regex.html](https://docs.python.org/zh-cn/3/howto/regex.html)