Pandas Read HTML Tables |\n\n

Pandas Read HTML Tables |

\n\n

Home
HTML
JavaScript
CSS
Vue
React
Python3
Java
C
C++
C#
AI
Go
SQL
Linux
VS Code
Bootstrap
Git
Local Bookmarks

\n\n

Pandas Tutorial

\n\n

Pandas Introduction
Pandas Installation
Pandas Series
Pandas DataFrame

\n\n

Data Reading and Writing

\n\n

Pandas Data Reading and Writing
Pandas CSV
Pandas Excel
Pandas JSON
Pandas Read SQL
Pandas Read HTML
Pandas Parquet / Feather
Pandas Data Export
Pandas Data Cleaning
Pandas Common Functions
Pandas Correlation Analysis
Pandas Data Sorting and Aggregation
Pandas Data Visualization
Pandas Advanced Features
Pandas Performance Optimization
Pandas Stock Data Analysis
Pandas Index Explained
Pandas Multi-level Index
Pandas Data Types
Pandas Categorical Types

\n\n

Data Processing Core

\n\n

Pandas Data Selection
Pandas Filtering and Conditional Queries
Pandas Missing Value Handling
Pandas Duplicate Data Handling
Pandas String Operations
Pandas Dates and Times
Pandas Time Series Analysis
Pandas apply / map / applymap
Pandas Data Merging
Pandas Data Concatenation
Pandas Data Reshaping
Pandas Grouping Operations
Pandas Window Functions

\n\n

Reference Manual

\n\n

Pandas Common Functions
Pandas Input/Output API
Pandas Series API Reference
Pandas DataFrame API Reference
Pandas Arrays
Pandas Index Objects
Pandas DateOffset Objects
Pandas Quiz

\n\n

Statistics and Cases

\n\n

Pandas Descriptive Statistics
Pandas Sampling and Random Data
Pandas Data Binning
Pandas Handling Large Files
Pandas Combined with NumPy
Pandas Visualization
Pandas E-commerce Data
Pandas User Behavior
Pandas Read SQL Database
Pandas Read Parquet / Feather Files
Pandas Read HTML Tables

\n\n

The pd.read_html() function in pandas automatically parses HTML tables on web pages and converts them into DataFrames. This is extremely useful for web scraping, stock price analysis, financial data acquisition, and similar scenarios.

\n\n

Basic Usage

\n\n

The read_html() function finds and reads all HTML table elements on a web page and returns a list of DataFrames.

\n\n

Reading a Single Table

\n\n

import pandas as pd\n\n# Read all tables on the webpage\n# Note: Requires lxml and beautifulsoup4 libraries\n# pip install lxml beautifulsoup4\n\n# Read from URL\ntables = pd.read_html("https://example.com/table.html")\n\n# Check the number of tables found\nprint(f"Found {len(tables)} tables")\n\n# Get the first table\ndf = tables\nprint(df.head())\n

\n\n

Reading Multiple Tables

\n\n

A web page may contain multiple tables. read_html() returns a list, and you can select tables by index:

\n\n

import pandas as pd\n\n# Assume the webpage has multiple tables\ntables = pd.read_html("https://example.com/data.html")\n\n# Iterate through all tables\nfor i, table in enumerate(tables):\n    print(f"n=== Table {i+1} ===")\n    print(f"Shape: {table.shape}")\n    print(table.head(3))\n

\n\n

Detailed Explanation of Common Parameters

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

Parameter	Description	Example
io	Input source: URL, file path, or string	"page.html"
match	Regex match to filter tables containing specific text	match="sales"
header	Specify the row index for column headers	header=0
attrs	Filter tables by HTML attributes	attrs={"id": "table1"}
skiprows	Skip specified rows	skiprows=[0, 1]
na_values	Specify strings to treat as missing values	na_values=["N/A"]

\n\n

Using the match Parameter to Filter Tables

\n\n

import pandas as pd\n\n# Use match parameter to return only tables containing specific text\n# Example: Read only tables containing the word "stock"\ntables = pd.read_html(\n    "https://example.com/finance.html",\n    match="stock"\n)\n\nif tables:\n    df = tables\n    print(df)\n

\n\n

Using the attrs Parameter to Precisely Locate Tables

\n\n

import pandas as pd\n\n# Locate tables by HTML attributes\n# Example: Get the table with id="stock-table"\ntables = pd.read_html(\n    "https://example.com/data.html",\n    attrs={"id": "stock-table"}  # Find table element with id="stock-table"\n)\n\n# Or locate by class\ntables = pd.read_html(\n    "https://example.com/data.html",\n    attrs={"class": "data-table"}\n)\n\ndf = tables\nprint(df)\n

\n\n

Practical Example: Scraping Stock Data from a Web Page

\n\n

The following example demonstrates how to scrape stock list data from a financial website:

\n\n

import pandas as pd\nfrom io import StringIO\n\n# Create simulated HTML content for demonstration\n# Replace with a real URL in actual usage\nhtml_content = '''\n\n    \n        \n        \n        \n        \n    \n    \n        \n        \n        \n        \n    \n    \n        \n        \n        \n        \n    \n    \n        \n        \n        \n        \n    \nStock Code Stock Name Close Price Change %
600519 Kweichow Moutai 1850.00 2.35%
000858 Wuliangye 185.50 -1.20%
601318 China Ping An 52.30 0.85%
\n'''\n\n# Read table from string\ntables = pd.read_html(StringIO(html_content))\n\ndf = tables\n\nprint("Original Table:")\nprint(df)\nprint()\n\n# Data cleaning: Process the "Change %" column\ndf["Change %"] = df["Change %"].str.replace("%", "").astype(float)\ndf = df.astype(float)\n\nprint("Cleaned Data:")\nprint(df)\n

Stock Code	Stock Name	Close Price	Change %
600519	Kweichow Moutai	1850.00	2.35%
000858	Wuliangye	185.50	-1.20%
601318	China Ping An	52.30	0.85%

\n\n

The purpose of StringIO is to treat the string as a file-like object, so pandas processes it as a file stream instead of mistakenly interpreting it as a path or URL.

\n\n

Output:

\n\n

Original Table:\n   Stock Code      Stock Name  Close Price Change %\n0      600519  Kweichow Moutai      1850.00    2.35%\n1      000858       Wuliangye       185.50   -1.20%\n2      601318   China Ping An        52.30    0.85%\n\nCleaned Data:\n   Stock Code      Stock Name  Close Price  Change %\n0      600519  Kweichow Moutai      1850.00      2.35\n1      000858       Wuliangye       185.50     -1.20\n2      601318   China Ping An        52.30      0.85\n

\n\n

Reading Wikipedia Tables

\n\n

import pandas as pd\n\n# Read tables from a Wikipedia page\n# Example: List of countries by population\nurl = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"\n\ntry:\n    tables = pd.read_html(url)\n    print(f"This page contains {len(tables)} tables")\n\n    # Usually, the first table is what we need\n    df = tables\n    print(df.head(10))\nexcept Exception as e:\n    print(f"Failed to read: {e}")\n    print("Make sure required libraries are installed: pip install lxml beautifulsoup4")\n

\n\n

Data Cleaning and Processing

\n\n

Tables read from web pages often require further cleaning before analysis.

\n\n

Common Cleaning Steps

\n\n

import pandas as pd\nfrom io import StringIO\n\n# Simulate a table with messy data\nhtml_content = '''\n\n    \n    \n    \n    \nDate Product Sales Remarks
2024-01-01 Product A 1,234 N/A
2024-01-02 Product B 567 Out of stock
2024-01-03 Product C -- Not counted
\n'''\n\ndf = pd.read_html(StringIO(html_content))\nprint("Original Data:")\nprint(df)\nprint()\n\n# Cleaning steps\n# 1. Rename columns\ndf.columns = ["Date", "Product", "Sales", "Remarks"]\n\n# 2. Handle missing values\ndf = df.replace(["N/A", "--", "Not counted", "Out of stock"], pd.NA)\n\n# 3. Clean numeric column (remove commas)\ndf = df.str.replace(",", "").replace(pd.NA, None)\ndf = pd.to_numeric(df, errors="coerce")\n\n# 4. Convert dates\ndf = pd.to_datetime(df)\n\nprint("Cleaned Data:")\nprint(df)\nprint("nData Types:")\nprint(df.dtypes)\n

Date	Product	Sales	Remarks
2024-01-01	Product A	1,234	N/A
2024-01-02	Product B	567	Out of stock
2024-01-03	Product C	--	Not counted

\n\n

Output:

\n\n

Original Data:\n           Date   Product    Sales         Remarks\n0  2024-01-01  Product A  1,234           N/A\n1  2024-01-02  Product B    567  Out of stock\n2  2024-01-03  Product C     --  Not counted\n\nCleaned Data:\n          Date   Product      Sales    Remarks\n0 2024-01-01  Product A  1234.00        NaN\n1 2024-01-02  Product B   567.00  <NA>\n2 2024-01-03  Product C      NaN  <NA>\n\nData Types:\nDate      datetime64\nProduct            object\nSales             float64\nRemarks            object\ndtype: object\n

\n\n

Notes and Common Issues

\n\n

Install Dependencies
\n read_html() requires the lxml and beautifulsoup4 libraries:
\n pip install lxml beautifulsoup4
Network Issues
\n Reading from a network URL may encounter latency or access restrictions. Consider adding appropriate timeouts or using local caching.
Complex Table Structures
\n Some web tables use complex structures like merged cells or nested tables, which may cause parsing failures. In such cases, try parsing HTML directly with BeautifulSoup.
Data Integrity
\n Web data may change over time. Always verify data integrity after reading and compare it with the original source.

\n\n

When scraping web data, please comply with the website’s robots.txt rules and applicable laws and regulations. Avoid frequent requests to prevent server overload.

\n\n

Summary

\n\n

pd.read_html() is a powerful tool for scraping tabular data from web pages, quickly converting HTML tables into DataFrames. In practice, ensure dependencies are installed, handle network issues, and clean messy data. Scraped data typically requires further processing before analysis—combine it with pandas’ data cleaning capabilities.

\n\n

ByteDance Coding Plan
\nSupports mainstream large models such as Doubao, GLM, DeepSeek, Kimi, MiniMax, etc., directly provided by official sources—stable and reliable.

\n\n

Configuration Guide

\n\n

¥9.9 / month

\n\n

Subscribe Now

\n\n

iFLYTEKStellar Coding Plan
\nIncludes free model invocation quotas, DeepSeek, GLM, Kimi, MiniMax, one-stop experience and deployment platform.

\n\n

Configuration Guide

\n\n

¥3.9 / month

\n\n

Subscribe Now

\n\n

Share Notes

\n\n

Category Navigation

\n\n

Python / Data Science
AI / Intelligent Development
Front-end Development
Back-end Development
Databases
Mobile Development
DevOps / Engineering
Programming Languages
Computer Fundamentals
XML / Web Service
.NET
Website Development

\n\n

Pandas Tutorial

\n\n

Pandas Introduction
Pandas Installation
Pandas Series
Pandas DataFrame

\n\n

Data Reading and Writing

\n\n

Pandas Data Reading and Writing
Pandas CSV
Pandas Excel
Pandas JSON
Pandas Read SQL
Pandas Read HTML
Pandas Parquet / Feather
Pandas Data Export
Pandas Data Cleaning
Pandas Common Functions
Pandas Correlation Analysis
Pandas Data Sorting and Aggregation
Pandas Data Visualization
Pandas Advanced Features
Pandas Performance Optimization
Pandas Stock Data Analysis
Pandas Index Explained
Pandas Multi-level Index
Pandas Data Types
Pandas Categorical Types

\n\n

Data Processing Core

\n\n

Pandas Data Selection
Pandas Filtering and Conditional Queries
Pandas Missing Value Handling
Pandas Duplicate Data Handling
Pandas String Operations
Pandas Dates and Times
Pandas Time Series Analysis
Pandas apply / map / applymap
Pandas Data Merging
Pandas Data Concatenation
Pandas Data Reshaping
Pandas Grouping Operations
Pandas Window Functions

\n\n

Reference Manual

\n\n

Pandas Common Functions
Pandas Input/Output API
Pandas Series API Reference
Pandas DataFrame API Reference
Pandas Arrays
Pandas Index Objects
Pandas DateOffset Objects
Pandas Quiz

\n\n

Statistics and Cases

\n\n

Pandas Descriptive Statistics
Pandas Sampling and Random Data
Pandas Data Binning
Pandas Handling Large Files
Pandas Combined with NumPy
Pandas Visualization
Pandas E-commerce Data
Pandas User Behavior

\n\n

Online Examples

\n\n

HTML Examples
CSS Examples
JavaScript Examples
Ajax Examples
jQuery Examples
XML Examples
Java Examples

\n\n

Character Sets & Tools

\n\n

HTML Character Set Settings
HTML ASCII Character Set
JS Obfuscation / Encryption
PNG / JPEG Image Compression
HTML Color Picker
JSON Formatter Tool
Random Number Generator

\n\n

Latest Updates

\n\n

AI Agent
AI Evaluation and Security Research
AI System Architecture
Frontier Research Trends
Advanced NLP Techniques
Computer Vision AI
Deep Learning Fundamentals

\n\n

Site Information

\n\n

Follow WeChat

\n\n

My Favorites
Mark Article
Browse History
Clear All
No records available

YouTip

Pandas Html Tables

Pandas Read HTML Tables |

Pandas Tutorial

Data Reading and Writing

Data Processing Core

Reference Manual

Statistics and Cases

Basic Usage

Reading a Single Table

Reading Multiple Tables

Detailed Explanation of Common Parameters

Using the match Parameter to Filter Tables

Using the attrs Parameter to Precisely Locate Tables

Practical Example: Scraping Stock Data from a Web Page

Reading Wikipedia Tables

Data Cleaning and Processing

Common Cleaning Steps

Notes and Common Issues

Summary

Category Navigation

Advertisement

Pandas Tutorial

Data Reading and Writing

Data Processing Core

Reference Manual

Statistics and Cases

Online Examples

Character Sets & Tools

Latest Updates

Site Information

📂 Categories