Pandas Read HTML Tables |
\n\n- \n
- Home \n
- HTML \n
- JavaScript \n
- CSS \n
- Vue \n
- React \n
- Python3 \n
- Java \n
- C \n
- C++ \n
- C# \n
- AI \n
- Go \n
- SQL \n
- Linux \n
- VS Code \n
- Bootstrap \n
- Git \n
- Local Bookmarks \n
Pandas Tutorial
\n\n- \n
- Pandas Introduction \n
- Pandas Installation \n
- Pandas Series \n
- Pandas DataFrame \n
Data Reading and Writing
\n\n- \n
- Pandas Data Reading and Writing \n
- Pandas CSV \n
- Pandas Excel \n
- Pandas JSON \n
- Pandas Read SQL \n
- Pandas Read HTML \n
- Pandas Parquet / Feather \n
- Pandas Data Export \n
- Pandas Data Cleaning \n
- Pandas Common Functions \n
- Pandas Correlation Analysis \n
- Pandas Data Sorting and Aggregation \n
- Pandas Data Visualization \n
- Pandas Advanced Features \n
- Pandas Performance Optimization \n
- Pandas Stock Data Analysis \n
- Pandas Index Explained \n
- Pandas Multi-level Index \n
- Pandas Data Types \n
- Pandas Categorical Types \n
Data Processing Core
\n\n- \n
- Pandas Data Selection \n
- Pandas Filtering and Conditional Queries \n
- Pandas Missing Value Handling \n
- Pandas Duplicate Data Handling \n
- Pandas String Operations \n
- Pandas Dates and Times \n
- Pandas Time Series Analysis \n
- Pandas apply / map / applymap \n
- Pandas Data Merging \n
- Pandas Data Concatenation \n
- Pandas Data Reshaping \n
- Pandas Grouping Operations \n
- Pandas Window Functions \n
Reference Manual
\n\n- \n
- Pandas Common Functions \n
- Pandas Input/Output API \n
- Pandas Series API Reference \n
- Pandas DataFrame API Reference \n
- Pandas Arrays \n
- Pandas Index Objects \n
- Pandas DateOffset Objects \n
- Pandas Quiz \n
Statistics and Cases
\n\n- \n
- Pandas Descriptive Statistics \n
- Pandas Sampling and Random Data \n
- Pandas Data Binning \n
- Pandas Handling Large Files \n
- Pandas Combined with NumPy \n
- Pandas Visualization \n
- Pandas E-commerce Data \n
- Pandas User Behavior \n
- Pandas Read SQL Database \n
- Pandas Read Parquet / Feather Files \n
- Pandas Read HTML Tables \n
The pd.read_html() function in pandas automatically parses HTML tables on web pages and converts them into DataFrames. This is extremely useful for web scraping, stock price analysis, financial data acquisition, and similar scenarios.
Basic Usage
\n\nThe read_html() function finds and reads all HTML table elements on a web page and returns a list of DataFrames.
Reading a Single Table
\n\nimport pandas as pd\n\n# Read all tables on the webpage\n# Note: Requires lxml and beautifulsoup4 libraries\n# pip install lxml beautifulsoup4\n\n# Read from URL\ntables = pd.read_html("https://example.com/table.html")\n\n# Check the number of tables found\nprint(f"Found {len(tables)} tables")\n\n# Get the first table\ndf = tables\nprint(df.head())\n\n\nReading Multiple Tables
\n\nA web page may contain multiple tables. read_html() returns a list, and you can select tables by index:
import pandas as pd\n\n# Assume the webpage has multiple tables\ntables = pd.read_html("https://example.com/data.html")\n\n# Iterate through all tables\nfor i, table in enumerate(tables):\n print(f"n=== Table {i+1} ===")\n print(f"Shape: {table.shape}")\n print(table.head(3))\n\n\nDetailed Explanation of Common Parameters
\n\n| Parameter | \nDescription | \nExample | \n
|---|---|---|
| io | \nInput source: URL, file path, or string | \n"page.html" | \n
| match | \nRegex match to filter tables containing specific text | \nmatch="sales" | \n
| header | \nSpecify the row index for column headers | \nheader=0 | \n
| attrs | \nFilter tables by HTML attributes | \nattrs={"id": "table1"} | \n
| skiprows | \nSkip specified rows | \nskiprows=[0, 1] | \n
| na_values | \nSpecify strings to treat as missing values | \nna_values=["N/A"] | \n
Using the match Parameter to Filter Tables
\n\nimport pandas as pd\n\n# Use match parameter to return only tables containing specific text\n# Example: Read only tables containing the word "stock"\ntables = pd.read_html(\n "https://example.com/finance.html",\n match="stock"\n)\n\nif tables:\n df = tables\n print(df)\n\n\nUsing the attrs Parameter to Precisely Locate Tables
\n\nimport pandas as pd\n\n# Locate tables by HTML attributes\n# Example: Get the table with id="stock-table"\ntables = pd.read_html(\n "https://example.com/data.html",\n attrs={"id": "stock-table"} # Find table element with id="stock-table"\n)\n\n# Or locate by class\ntables = pd.read_html(\n "https://example.com/data.html",\n attrs={"class": "data-table"}\n)\n\ndf = tables\nprint(df)\n\n\nPractical Example: Scraping Stock Data from a Web Page
\n\nThe following example demonstrates how to scrape stock list data from a financial website:
\n\nimport pandas as pd\nfrom io import StringIO\n\n# Create simulated HTML content for demonstration\n# Replace with a real URL in actual usage\nhtml_content = '''\n\n \n Stock Code \n Stock Name \n Close Price \n Change % \n \n \n 600519 \n Kweichow Moutai \n 1850.00 \n 2.35% \n \n \n 000858 \n Wuliangye \n 185.50 \n -1.20% \n \n \n 601318 \n China Ping An \n 52.30 \n 0.85% \n \n
\n'''\n\n# Read table from string\ntables = pd.read_html(StringIO(html_content))\n\ndf = tables\n\nprint("Original Table:")\nprint(df)\nprint()\n\n# Data cleaning: Process the "Change %" column\ndf["Change %"] = df["Change %"].str.replace("%", "").astype(float)\ndf = df.astype(float)\n\nprint("Cleaned Data:")\nprint(df)\n\n\nThe purpose of StringIO is to treat the string as a file-like object, so pandas processes it as a file stream instead of mistakenly interpreting it as a path or URL.
Output:
\n\nOriginal Table:\n Stock Code Stock Name Close Price Change %\n0 600519 Kweichow Moutai 1850.00 2.35%\n1 000858 Wuliangye 185.50 -1.20%\n2 601318 China Ping An 52.30 0.85%\n\nCleaned Data:\n Stock Code Stock Name Close Price Change %\n0 600519 Kweichow Moutai 1850.00 2.35\n1 000858 Wuliangye 185.50 -1.20\n2 601318 China Ping An 52.30 0.85\n\n\nReading Wikipedia Tables
\n\nimport pandas as pd\n\n# Read tables from a Wikipedia page\n# Example: List of countries by population\nurl = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"\n\ntry:\n tables = pd.read_html(url)\n print(f"This page contains {len(tables)} tables")\n\n # Usually, the first table is what we need\n df = tables\n print(df.head(10))\nexcept Exception as e:\n print(f"Failed to read: {e}")\n print("Make sure required libraries are installed: pip install lxml beautifulsoup4")\n\n\nData Cleaning and Processing
\n\nTables read from web pages often require further cleaning before analysis.
\n\nCommon Cleaning Steps
\n\nimport pandas as pd\nfrom io import StringIO\n\n# Simulate a table with messy data\nhtml_content = '''\n\n Date Product Sales Remarks \n 2024-01-01 Product A 1,234 N/A \n 2024-01-02 Product B 567 Out of stock \n 2024-01-03 Product C -- Not counted \n
\n'''\n\ndf = pd.read_html(StringIO(html_content))\nprint("Original Data:")\nprint(df)\nprint()\n\n# Cleaning steps\n# 1. Rename columns\ndf.columns = ["Date", "Product", "Sales", "Remarks"]\n\n# 2. Handle missing values\ndf = df.replace(["N/A", "--", "Not counted", "Out of stock"], pd.NA)\n\n# 3. Clean numeric column (remove commas)\ndf = df.str.replace(",", "").replace(pd.NA, None)\ndf = pd.to_numeric(df, errors="coerce")\n\n# 4. Convert dates\ndf = pd.to_datetime(df)\n\nprint("Cleaned Data:")\nprint(df)\nprint("nData Types:")\nprint(df.dtypes)\n\n\nOutput:
\n\nOriginal Data:\n Date Product Sales Remarks\n0 2024-01-01 Product A 1,234 N/A\n1 2024-01-02 Product B 567 Out of stock\n2 2024-01-03 Product C -- Not counted\n\nCleaned Data:\n Date Product Sales Remarks\n0 2024-01-01 Product A 1234.00 NaN\n1 2024-01-02 Product B 567.00 <NA>\n2 2024-01-03 Product C NaN <NA>\n\nData Types:\nDate datetime64\nProduct object\nSales float64\nRemarks object\ndtype: object\n\n\nNotes and Common Issues
\n\n- \n
- Install Dependencies
\nread_html()requires thelxmlandbeautifulsoup4libraries:
\npip install lxml beautifulsoup4\n\n - Network Issues
\n Reading from a network URL may encounter latency or access restrictions. Consider adding appropriate timeouts or using local caching. \n\n - Complex Table Structures
\n Some web tables use complex structures like merged cells or nested tables, which may cause parsing failures. In such cases, try parsing HTML directly with BeautifulSoup. \n\n - Data Integrity
\n Web data may change over time. Always verify data integrity after reading and compare it with the original source. \n
When scraping web data, please comply with the websiteβs robots.txt rules and applicable laws and regulations. Avoid frequent requests to prevent server overload.
Summary
\n\npd.read_html() is a powerful tool for scraping tabular data from web pages, quickly converting HTML tables into DataFrames. In practice, ensure dependencies are installed, handle network issues, and clean messy data. Scraped data typically requires further processing before analysisβcombine it with pandasβ data cleaning capabilities.
ByteDance Coding Plan
\nSupports mainstream large models such as Doubao, GLM, DeepSeek, Kimi, MiniMax, etc., directly provided by official sourcesβstable and reliable.
Β₯9.9 / month
\n\n\n\niFLYTEKStellar Coding Plan
\nIncludes free model invocation quotas, DeepSeek, GLM, Kimi, MiniMax, one-stop experience and deployment platform.
Β₯3.9 / month
\n\n\n\n\n\nCategory Navigation
\n\n- \n
- Python / Data Science \n
- AI / Intelligent Development \n
- Front-end Development \n
- Back-end Development \n
- Databases \n
- Mobile Development \n
- DevOps / Engineering \n
- Programming Languages \n
- Computer Fundamentals \n
- XML / Web Service \n
- .NET \n
- Website Development \n
Advertisement
\n\nPandas Tutorial
\n\n- \n
- Pandas Introduction \n
- Pandas Installation \n
- Pandas Series \n
- Pandas DataFrame \n
Data Reading and Writing
\n\n- \n
- Pandas Data Reading and Writing \n
- Pandas CSV \n
- Pandas Excel \n
- Pandas JSON \n
- Pandas Read SQL \n
- Pandas Read HTML \n
- Pandas Parquet / Feather \n
- Pandas Data Export \n
- Pandas Data Cleaning \n
- Pandas Common Functions \n
- Pandas Correlation Analysis \n
- Pandas Data Sorting and Aggregation \n
- Pandas Data Visualization \n
- Pandas Advanced Features \n
- Pandas Performance Optimization \n
- Pandas Stock Data Analysis \n
- Pandas Index Explained \n
- Pandas Multi-level Index \n
- Pandas Data Types \n
- Pandas Categorical Types \n
Data Processing Core
\n\n- \n
- Pandas Data Selection \n
- Pandas Filtering and Conditional Queries \n
- Pandas Missing Value Handling \n
- Pandas Duplicate Data Handling \n
- Pandas String Operations \n
- Pandas Dates and Times \n
- Pandas Time Series Analysis \n
- Pandas apply / map / applymap \n
- Pandas Data Merging \n
- Pandas Data Concatenation \n
- Pandas Data Reshaping \n
- Pandas Grouping Operations \n
- Pandas Window Functions \n
Reference Manual
\n\n- \n
- Pandas Common Functions \n
- Pandas Input/Output API \n
- Pandas Series API Reference \n
- Pandas DataFrame API Reference \n
- Pandas Arrays \n
- Pandas Index Objects \n
- Pandas DateOffset Objects \n
- Pandas Quiz \n
Statistics and Cases
\n\n- \n
- Pandas Descriptive Statistics \n
- Pandas Sampling and Random Data \n
- Pandas Data Binning \n
- Pandas Handling Large Files \n
- Pandas Combined with NumPy \n
- Pandas Visualization \n
- Pandas E-commerce Data \n
- Pandas User Behavior \n
Online Examples
\n\n- \n
- HTML Examples \n
- CSS Examples \n
- JavaScript Examples \n
- Ajax Examples \n
- jQuery Examples \n
- XML Examples \n
- Java Examples \n
Character Sets & Tools
\n\n- \n
- HTML Character Set Settings \n
- HTML ASCII Character Set \n
- JS Obfuscation / Encryption \n
- PNG / JPEG Image Compression \n
- HTML Color Picker \n
- JSON Formatter Tool \n
- Random Number Generator \n
Latest Updates
\n\n- \n
- AI Agent \n
- AI Evaluation and Security Research \n
- AI System Architecture \n
- Frontier Research Trends \n
- Advanced NLP Techniques \n
- Computer Vision AI \n
- Deep Learning Fundamentals \n
Site Information
\n\n- \n
- Feedback \n
- Disclaimer \n
- About Us \n
- Article Archive \n
Follow WeChat
\n\n\n
- \n
- My Favorites \n
- Mark Article \n
- Browse History \n
- Clear All \n
- No records available \n
YouTip