BeautifulSoup 常用方法

一、查找元素

1. 基本查找

.find(name, attrs, recursive, text)
- 查找第一个符合条件的标签。
- 参数：
  - name: 标签名（字符串、列表或正则表达式）。
  - attrs: 属性字典。
  - recursive: 是否递归查找，默认为 True。
  - text: 标签内的文字内容。
- 示例：soup.find('a', {'class': 'link'})
.find_all(name, attrs, recursive, text)
- 查找所有符合条件的标签，返回列表。
- 参数同上。
- 示例：soup.find_all(['p', 'div'])

2. 基于 CSS 选择器

.select(selector)
- 根据 CSS 选择器查找第一个元素。
- 示例：soup.select('a[href]')[0]
.select_all(selector) 或 find_all()
- 根据 CSS 选择器查找所有符合条件的元素，返回列表。
- 示例：soup.select('.container')

二、筛选元素

1. 标签名

查找特定标签：

# 查找所有 <p> 标签
paragraphs = soup.find_all('p')

2. 属性

根据属性值筛选：

# 查找所有具有 class="link" 的 <a> 标签
links = soup.find_all('a', {'class': 'link'})

3. 文本内容

根据标签内的文字内容查找：

# 查找包含 "Click here" 文字的所有链接
links_with_text = soup.find_all('a', text='Click here')

4. 嵌套结构

使用 CSS 选择器表示层次关系：

# 查找 <div class="container"> 中的所有 <p> 标签
paragraphs_in_container = soup.select('.container > p')

三、提取数据

1. 文本内容

获取标签内的文字：

# 获取第一个 <h1> 标签的文本
title = soup.find('h1').text

2. 属性值

获取标签的属性值：

# 获取第一个 <img> 标签的 src 属性值
image_url = soup.find('img')['src']

3. 所有文本

提取整个文档中的所有文字内容：
PYTHON
all_text = soup.get_text()

四、遍历文档

1. 遍历标签

使用 .contents 或 .children 属性遍历子元素：
PYTHON
for child in soup.contents: print(child.name)

2. 使用 `.parent` 和 `.parents`

获取父级和祖先元素：

parent_tag = soup.find('a').parent
grandparents = soup.find('a').parents

3. 使用 `.next_sibling` 和 `.previous_sibling`

获取下一个和上一个兄弟节点：

next_tag = soup.find('a').next_sibling
prev_tag = soup.find('a').previous_sibling

五、修改文档

1. 替换内容

使用 .string 属性替换标签内的文字：
PYTHON
soup.find('h1').string = 'New Title'

2. 添加新标签

使用 new_tag 方法创建并插入新标签：

new_p = soup.new_tag('p', id='new-para')
soup.body.append(new_p)

3. 删除标签

使用 .decompose() 方法删除元素及其内容： python unwanted_div = soup.find('div', {'class': 'ad'}) unwanted_div.decompose()

六、高级功能

1. 正则表达式匹配

使用 re.compile 匹配特定标签或内容：

import re
pattern = re.compile(r'^[a-zA-Z0-9]+$')
tags = soup.find_all('input', {'id': pattern})

2. 自定义函数过滤

使用自定义函数作为筛选条件：

def has_href(tag):
    return tag.has_attr('href')

links = soup.find_all(has_href)

七、处理 HTML 文档

1. 解析 HTML

使用不同解析器：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')  # 或 'lxml', 'html5lib'

2. 格式化输出

美化 HTML 结构：

pretty_html = soup.prettify()
print(pretty_html)

八、常用示例

提取所有链接

for link in soup.find_all('a'):
    print(link.get('href'))

获取表格数据

table = soup.find('table')
rows = table.find_all('tr')
for row in rows:
    cols = row.find_all(['th', 'td'])
    cols = [col.text.strip() for col in cols]
    print(cols)

处理嵌套结构

nav = soup.select('.nav')[0]
links_in_nav = nav.find_all('a')
for link in links_in_nav:
    print(link.text, link['href'])

BeautifulSoup 常用方法

BeautifulSoup 常用方法

一、查找元素

1. 基本查找

2. 基于 CSS 选择器

二、筛选元素

1. 标签名

2. 属性

3. 文本内容

4. 嵌套结构

三、提取数据

1. 文本内容

2. 属性值

3. 所有文本

四、遍历文档

1. 遍历标签

2. 使用 .parent 和 .parents

3. 使用 .next_sibling 和 .previous_sibling

五、修改文档

1. 替换内容

2. 添加新标签

3. 删除标签

六、高级功能

1. 正则表达式匹配

2. 自定义函数过滤

七、处理 HTML 文档

1. 解析 HTML

2. 格式化输出

八、常用示例

2. 使用 `.parent` 和 `.parents`

3. 使用 `.next_sibling` 和 `.previous_sibling`