YouTip LogoYouTip

Pandas Html Tables

Pandas Read HTML Tables |\n\n

Pandas Read HTML Tables |

\n\n\n\n

Pandas Tutorial

\n\n
    \n
  • Pandas Introduction
  • \n
  • Pandas Installation
  • \n
  • Pandas Series
  • \n
  • Pandas DataFrame
  • \n
\n\n

Data Reading and Writing

\n\n
    \n
  • Pandas Data Reading and Writing
  • \n
  • Pandas CSV
  • \n
  • Pandas Excel
  • \n
  • Pandas JSON
  • \n
  • Pandas Read SQL
  • \n
  • Pandas Read HTML
  • \n
  • Pandas Parquet / Feather
  • \n
  • Pandas Data Export
  • \n
  • Pandas Data Cleaning
  • \n
  • Pandas Common Functions
  • \n
  • Pandas Correlation Analysis
  • \n
  • Pandas Data Sorting and Aggregation
  • \n
  • Pandas Data Visualization
  • \n
  • Pandas Advanced Features
  • \n
  • Pandas Performance Optimization
  • \n
  • Pandas Stock Data Analysis
  • \n
  • Pandas Index Explained
  • \n
  • Pandas Multi-level Index
  • \n
  • Pandas Data Types
  • \n
  • Pandas Categorical Types
  • \n
\n\n

Data Processing Core

\n\n
    \n
  • Pandas Data Selection
  • \n
  • Pandas Filtering and Conditional Queries
  • \n
  • Pandas Missing Value Handling
  • \n
  • Pandas Duplicate Data Handling
  • \n
  • Pandas String Operations
  • \n
  • Pandas Dates and Times
  • \n
  • Pandas Time Series Analysis
  • \n
  • Pandas apply / map / applymap
  • \n
  • Pandas Data Merging
  • \n
  • Pandas Data Concatenation
  • \n
  • Pandas Data Reshaping
  • \n
  • Pandas Grouping Operations
  • \n
  • Pandas Window Functions
  • \n
\n\n

Reference Manual

\n\n
    \n
  • Pandas Common Functions
  • \n
  • Pandas Input/Output API
  • \n
  • Pandas Series API Reference
  • \n
  • Pandas DataFrame API Reference
  • \n
  • Pandas Arrays
  • \n
  • Pandas Index Objects
  • \n
  • Pandas DateOffset Objects
  • \n
  • Pandas Quiz
  • \n
\n\n

Statistics and Cases

\n\n
    \n
  • Pandas Descriptive Statistics
  • \n
  • Pandas Sampling and Random Data
  • \n
  • Pandas Data Binning
  • \n
  • Pandas Handling Large Files
  • \n
  • Pandas Combined with NumPy
  • \n
  • Pandas Visualization
  • \n
  • Pandas E-commerce Data
  • \n
  • Pandas User Behavior
  • \n
  • Pandas Read SQL Database
  • \n
  • Pandas Read Parquet / Feather Files
  • \n
  • Pandas Read HTML Tables
  • \n
\n\n

The pd.read_html() function in pandas automatically parses HTML tables on web pages and converts them into DataFrames. This is extremely useful for web scraping, stock price analysis, financial data acquisition, and similar scenarios.

\n\n

Basic Usage

\n\n

The read_html() function finds and reads all HTML table elements on a web page and returns a list of DataFrames.

\n\n

Reading a Single Table

\n\n
import pandas as pd\n\n# Read all tables on the webpage\n# Note: Requires lxml and beautifulsoup4 libraries\n# pip install lxml beautifulsoup4\n\n# Read from URL\ntables = pd.read_html("https://example.com/table.html")\n\n# Check the number of tables found\nprint(f"Found {len(tables)} tables")\n\n# Get the first table\ndf = tables\nprint(df.head())\n
\n\n

Reading Multiple Tables

\n\n

A web page may contain multiple tables. read_html() returns a list, and you can select tables by index:

\n\n
import pandas as pd\n\n# Assume the webpage has multiple tables\ntables = pd.read_html("https://example.com/data.html")\n\n# Iterate through all tables\nfor i, table in enumerate(tables):\n    print(f"n=== Table {i+1} ===")\n    print(f"Shape: {table.shape}")\n    print(table.head(3))\n
\n\n

Detailed Explanation of Common Parameters

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ParameterDescriptionExample
ioInput source: URL, file path, or string"page.html"
matchRegex match to filter tables containing specific textmatch="sales"
headerSpecify the row index for column headersheader=0
attrsFilter tables by HTML attributesattrs={"id": "table1"}
skiprowsSkip specified rowsskiprows=[0, 1]
na_valuesSpecify strings to treat as missing valuesna_values=["N/A"]
\n\n

Using the match Parameter to Filter Tables

\n\n
import pandas as pd\n\n# Use match parameter to return only tables containing specific text\n# Example: Read only tables containing the word "stock"\ntables = pd.read_html(\n    "https://example.com/finance.html",\n    match="stock"\n)\n\nif tables:\n    df = tables\n    print(df)\n
\n\n

Using the attrs Parameter to Precisely Locate Tables

\n\n
import pandas as pd\n\n# Locate tables by HTML attributes\n# Example: Get the table with id="stock-table"\ntables = pd.read_html(\n    "https://example.com/data.html",\n    attrs={"id": "stock-table"}  # Find table element with id="stock-table"\n)\n\n# Or locate by class\ntables = pd.read_html(\n    "https://example.com/data.html",\n    attrs={"class": "data-table"}\n)\n\ndf = tables\nprint(df)\n
\n\n

Practical Example: Scraping Stock Data from a Web Page

\n\n

The following example demonstrates how to scrape stock list data from a financial website:

\n\n
import pandas as pd\nfrom io import StringIO\n\n# Create simulated HTML content for demonstration\n# Replace with a real URL in actual usage\nhtml_content = '''\n\n    \n        \n        \n        \n        \n    \n    \n        \n        \n        \n        \n    \n    \n        \n        \n        \n        \n    \n    \n        \n        \n        \n        \n    \n
Stock CodeStock NameClose PriceChange %
600519Kweichow Moutai1850.002.35%
000858Wuliangye185.50-1.20%
601318China Ping An52.300.85%
\n'''\n\n# Read table from string\ntables = pd.read_html(StringIO(html_content))\n\ndf = tables\n\nprint("Original Table:")\nprint(df)\nprint()\n\n# Data cleaning: Process the "Change %" column\ndf["Change %"] = df["Change %"].str.replace("%", "").astype(float)\ndf = df.astype(float)\n\nprint("Cleaned Data:")\nprint(df)\n
\n\n

The purpose of StringIO is to treat the string as a file-like object, so pandas processes it as a file stream instead of mistakenly interpreting it as a path or URL.

\n\n

Output:

\n\n
Original Table:\n   Stock Code      Stock Name  Close Price Change %\n0      600519  Kweichow Moutai      1850.00    2.35%\n1      000858       Wuliangye       185.50   -1.20%\n2      601318   China Ping An        52.30    0.85%\n\nCleaned Data:\n   Stock Code      Stock Name  Close Price  Change %\n0      600519  Kweichow Moutai      1850.00      2.35\n1      000858       Wuliangye       185.50     -1.20\n2      601318   China Ping An        52.30      0.85\n
\n\n

Reading Wikipedia Tables

\n\n
import pandas as pd\n\n# Read tables from a Wikipedia page\n# Example: List of countries by population\nurl = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"\n\ntry:\n    tables = pd.read_html(url)\n    print(f"This page contains {len(tables)} tables")\n\n    # Usually, the first table is what we need\n    df = tables\n    print(df.head(10))\nexcept Exception as e:\n    print(f"Failed to read: {e}")\n    print("Make sure required libraries are installed: pip install lxml beautifulsoup4")\n
\n\n

Data Cleaning and Processing

\n\n

Tables read from web pages often require further cleaning before analysis.

\n\n

Common Cleaning Steps

\n\n
import pandas as pd\nfrom io import StringIO\n\n# Simulate a table with messy data\nhtml_content = '''\n\n    \n    \n    \n    \n
DateProductSalesRemarks
2024-01-01Product A1,234N/A
2024-01-02Product B567Out of stock
2024-01-03Product C--Not counted
\n'''\n\ndf = pd.read_html(StringIO(html_content))\nprint("Original Data:")\nprint(df)\nprint()\n\n# Cleaning steps\n# 1. Rename columns\ndf.columns = ["Date", "Product", "Sales", "Remarks"]\n\n# 2. Handle missing values\ndf = df.replace(["N/A", "--", "Not counted", "Out of stock"], pd.NA)\n\n# 3. Clean numeric column (remove commas)\ndf = df.str.replace(",", "").replace(pd.NA, None)\ndf = pd.to_numeric(df, errors="coerce")\n\n# 4. Convert dates\ndf = pd.to_datetime(df)\n\nprint("Cleaned Data:")\nprint(df)\nprint("nData Types:")\nprint(df.dtypes)\n
\n\n

Output:

\n\n
Original Data:\n           Date   Product    Sales         Remarks\n0  2024-01-01  Product A  1,234           N/A\n1  2024-01-02  Product B    567  Out of stock\n2  2024-01-03  Product C     --  Not counted\n\nCleaned Data:\n          Date   Product      Sales    Remarks\n0 2024-01-01  Product A  1234.00        NaN\n1 2024-01-02  Product B   567.00  <NA>\n2 2024-01-03  Product C      NaN  <NA>\n\nData Types:\nDate      datetime64\nProduct            object\nSales             float64\nRemarks            object\ndtype: object\n
\n\n

Notes and Common Issues

\n\n
    \n
  1. Install Dependencies
    \n read_html() requires the lxml and beautifulsoup4 libraries:
    \n pip install lxml beautifulsoup4
  2. \n\n
  3. Network Issues
    \n Reading from a network URL may encounter latency or access restrictions. Consider adding appropriate timeouts or using local caching.
  4. \n\n
  5. Complex Table Structures
    \n Some web tables use complex structures like merged cells or nested tables, which may cause parsing failures. In such cases, try parsing HTML directly with BeautifulSoup.
  6. \n\n
  7. Data Integrity
    \n Web data may change over time. Always verify data integrity after reading and compare it with the original source.
  8. \n
\n\n

When scraping web data, please comply with the website’s robots.txt rules and applicable laws and regulations. Avoid frequent requests to prevent server overload.

\n\n

Summary

\n\n

pd.read_html() is a powerful tool for scraping tabular data from web pages, quickly converting HTML tables into DataFrames. In practice, ensure dependencies are installed, handle network issues, and clean messy data. Scraped data typically requires further processing before analysisβ€”combine it with pandas’ data cleaning capabilities.

\n\n\n\n

ByteDance Coding Plan
\nSupports mainstream large models such as Doubao, GLM, DeepSeek, Kimi, MiniMax, etc., directly provided by official sourcesβ€”stable and reliable.

\n\n

Configuration Guide

\n\n

Β₯9.9 / month

\n\n

Subscribe Now

\n\n

iFLYTEKStellar Coding Plan
\nIncludes free model invocation quotas, DeepSeek, GLM, Kimi, MiniMax, one-stop experience and deployment platform.

\n\n

Configuration Guide

\n\n

Β₯3.9 / month

\n\n

Subscribe Now

\n\n

Share Notes

\n\n

Category Navigation

\n\n
    \n
  • Python / Data Science
  • \n
  • AI / Intelligent Development
  • \n
  • Front-end Development
  • \n
  • Back-end Development
  • \n
  • Databases
  • \n
  • Mobile Development
  • \n
  • DevOps / Engineering
  • \n
  • Programming Languages
  • \n
  • Computer Fundamentals
  • \n
  • XML / Web Service
  • \n
  • .NET
  • \n
  • Website Development
  • \n
\n\n

Advertisement

\n\n

Pandas Tutorial

\n\n
    \n
  • Pandas Introduction
  • \n
  • Pandas Installation
  • \n
  • Pandas Series
  • \n
  • Pandas DataFrame
  • \n
\n\n

Data Reading and Writing

\n\n
    \n
  • Pandas Data Reading and Writing
  • \n
  • Pandas CSV
  • \n
  • Pandas Excel
  • \n
  • Pandas JSON
  • \n
  • Pandas Read SQL
  • \n
  • Pandas Read HTML
  • \n
  • Pandas Parquet / Feather
  • \n
  • Pandas Data Export
  • \n
  • Pandas Data Cleaning
  • \n
  • Pandas Common Functions
  • \n
  • Pandas Correlation Analysis
  • \n
  • Pandas Data Sorting and Aggregation
  • \n
  • Pandas Data Visualization
  • \n
  • Pandas Advanced Features
  • \n
  • Pandas Performance Optimization
  • \n
  • Pandas Stock Data Analysis
  • \n
  • Pandas Index Explained
  • \n
  • Pandas Multi-level Index
  • \n
  • Pandas Data Types
  • \n
  • Pandas Categorical Types
  • \n
\n\n

Data Processing Core

\n\n
    \n
  • Pandas Data Selection
  • \n
  • Pandas Filtering and Conditional Queries
  • \n
  • Pandas Missing Value Handling
  • \n
  • Pandas Duplicate Data Handling
  • \n
  • Pandas String Operations
  • \n
  • Pandas Dates and Times
  • \n
  • Pandas Time Series Analysis
  • \n
  • Pandas apply / map / applymap
  • \n
  • Pandas Data Merging
  • \n
  • Pandas Data Concatenation
  • \n
  • Pandas Data Reshaping
  • \n
  • Pandas Grouping Operations
  • \n
  • Pandas Window Functions
  • \n
\n\n

Reference Manual

\n\n
    \n
  • Pandas Common Functions
  • \n
  • Pandas Input/Output API
  • \n
  • Pandas Series API Reference
  • \n
  • Pandas DataFrame API Reference
  • \n
  • Pandas Arrays
  • \n
  • Pandas Index Objects
  • \n
  • Pandas DateOffset Objects
  • \n
  • Pandas Quiz
  • \n
\n\n

Statistics and Cases

\n\n
    \n
  • Pandas Descriptive Statistics
  • \n
  • Pandas Sampling and Random Data
  • \n
  • Pandas Data Binning
  • \n
  • Pandas Handling Large Files
  • \n
  • Pandas Combined with NumPy
  • \n
  • Pandas Visualization
  • \n
  • Pandas E-commerce Data
  • \n
  • Pandas User Behavior
  • \n
\n\n

Online Examples

\n\n
    \n
  • HTML Examples
  • \n
  • CSS Examples
  • \n
  • JavaScript Examples
  • \n
  • Ajax Examples
  • \n
  • jQuery Examples
  • \n
  • XML Examples
  • \n
  • Java Examples
  • \n
\n\n

Character Sets & Tools

\n\n
    \n
  • HTML Character Set Settings
  • \n
  • HTML ASCII Character Set
  • \n
  • JS Obfuscation / Encryption
  • \n
  • PNG / JPEG Image Compression
  • \n
  • HTML Color Picker
  • \n
  • JSON Formatter Tool
  • \n
  • Random Number Generator
  • \n
\n\n

Latest Updates

\n\n
    \n
  • AI Agent
  • \n
  • AI Evaluation and Security Research
  • \n
  • AI System Architecture
  • \n
  • Frontier Research Trends
  • \n
  • Advanced NLP Techniques
  • \n
  • Computer Vision AI
  • \n
  • Deep Learning Fundamentals
  • \n
\n\n

Site Information

\n\n\n\n

Follow WeChat

\n\n

\n

← Pandas Data ExportTs React β†’