Skills Project Data
This project builds a data cleaning Skill that automatically processes user-uploaded CSV/Excel data, completing cleaning, statistics, and report generation.
Goal: Master pandas data processing pipeline and multi-step Skill design.
* * *
## Project Features Overview
| Step | Operation | Output |
| --- | --- | --- |
| 1. Load Data | Supports CSV, Excel | Row count, column names, data types |
| 2. Quality Check | Null values, duplicate rows, outlier scanning | Quality issue report |
| 3. Data Cleaning | Deduplication, fill null values, standardize column names | Cleaned CSV |
| 4. Statistical Analysis | Mean, median, distribution | Statistical summary table |
| 5. Generate Report | Summarize above results | Excel report file |
* * *
## Directory Structure
data-cleaner/βββ SKILL.md βββ scripts/ βββ requirements.txt βββ load_data.py βββ quality_check.py βββ clean_data.py βββ calc_stats.py βββ gen_report.py
* * *
## Step 1: Load Data
## Example
# File path: scripts/load_data.py
import pandas as pd
import sys, json,os
def load(file_path: str) ->dict:
"""Load CSV or Excel file, return basic information"""
if not os.path.exists(file_path):
return{"status": "error","message": f"File does not exist: {file_path}"}
ext =os.path.splitext(file_path).lower()
try:
if ext ==".csv":
df = pd.read_csv(file_path, encoding="utf-8")
elif ext in(".xlsx",".xls"):
df = pd.read_excel(file_path)
else:
return{"status": "error",
"message": f"Unsupported format: {ext}, please use .csv or .xlsx"}
except UnicodeDecodeError:
# Try GBK encoding (common for CSV exported from Chinese Windows)
df = pd.read_csv(file_path, encoding="gbk")
# Save df to temporary file for subsequent steps
tmp_path ="/home/claude/loaded_data.csv"
df.to_csv(tmp_path, index=False, encoding="utf-8")
return{
"status": "success",
"rows": len(df),
"cols": len(df.columns),
"columns": list(df.columns),
"dtypes": df.dtypes.astype(str).to_dict(),
"tmp_path": tmp_path
}
if __name__ =="__main__":
print(json.dumps(load(sys.argvif len(sys.argv)>1 else""),
ensure_ascii=False, indent=2))
* * *
## Step 2: Quality Check
## Example
# File path: scripts/quality_check.py
import pandas as pd
import sys, json
def check(file_path: str) ->dict:
"""Scan for data quality issues"""
df = pd.read_csv(file_path)
# Null value statistics
null_counts = df.isnull().sum()
null_issues =[
{"column": col,"null_count": int(cnt),
"null_pct": round(cnt / len(df) * 100,1)}
for col, cnt in null_counts.items()if cnt >0
]
# Duplicate rows
dup_count =int(df.duplicated().sum())
# Outlier detection (IQR method for numeric columns)
outlier_issues =[]
for col in df.select_dtypes(include="number").columns:
q1, q3 = df.quantile(0.25), df.quantile(0.75)
iqr = q3 - q1
lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
outliers = df[(df< lower) | (df> upper)]
if len(outliers)>0:
outlier_issues.append({
"column": col,
"count": len(outliers),
"range": f"[{lower:.2f}, {upper:.2f}]"
})
issues_count =len(null_issues) + (1 if dup_count >0 else 0) + len(outlier_issues)
return{
"status": "success",
"total_issues": issues_count,
"duplicate_rows": dup_count,
"null_columns": null_issues,
"outlier_columns": outlier_issues
}
if __name__ =="__main__":
print(json.dumps(check(sys.argvif len(sys.argv)>1 else""),
ensure_ascii=False, indent=2))
* * *
## Step 3: Data Cleaning
## Example
# File path: scripts/clean_data.py
import pandas as pd
import sys, json,os
def clean(file_path: str, output_path: str="/home/claude/cleaned_data.csv") ->dict:
"""Execute standard cleaning process"""
df = pd.read_csv(file_path)
stats ={}
# 1. Standardize column names (remove spaces, lowercase, replace spaces with underscores)
original_cols =list(df.columns)
df.columns=[c.strip().lower().replace(" ","_")for c in df.columns]
stats=sum(a != b for a, b in zip(original_cols, df.columns))
# 2. Delete completely duplicate rows
before =len(df)
df.drop_duplicates(inplace=True)
stats= before - len(df)
# 3. Fill null values in numeric columns with 0, string columns with empty string
null_before =int(df.isnull().sum().sum())
for col in df.columns:
if df.dtype in("int64","float64"):
df.fillna(0, inplace=True)
else:
df.fillna("", inplace=True)
stats= null_before - int(df.isnull().sum().sum())
# 4. Remove leading and trailing spaces from strings
for col in df.select_dtypes(include="object").columns:
df= df.str.strip()
df.to_csv(output_path, index=False, encoding="utf-8")
return{
"status": "success",
"output": output_path,
"rows_after": len(df),
"cols_renamed": stats,
"dup_removed": stats,
"nulls_filled": stats
}
if __name__ =="__main__":
print(json.dumps(clean(sys.argvif len(sys.argv)>1 else""),
ensure_ascii=False, indent=2))
* * *
## Complete SKILL.md Content
--- name: data-cleaner version: 1.0.0 description: > Clean and analyze user-uploaded CSV/Excel data, including quality check, deduplication, null value processing, statistical analysis, and report generation. Triggered when user needs data cleaning, data preprocessing, statistical analysis, or data quality report.---# Data Cleaning and Analysis Assistant## Execution Process### Step 1: Load Data```bash cd scripts/ python load_data.py ```Inform user about data scale (row count, column count) and column names.### Step 2: Quality Check```bash python quality_check.py /home/claude/loaded_data.csv ```Display quality report: which columns have null values, how many duplicate rows, which columns have outliers. Ask user: Clean according to default rules? Or need custom processing method?### Step 3: Execute CleaningAfter user confirms, execute:```bash python clean_data.py /home/claude/loaded_data.csv ```Inform cleaning results: how many duplicate rows removed, how many null values filled.### Step 4: Output Cleaned FileCopy /home/claude/cleaned_data.csv to output directory, then call present_files to display.
YouTip