YouTip LogoYouTip

Pandas Functions

The following lists some common functions in Pandas and their usage examples: ## Reading Data | Function | Description | | --- | --- | | pd.read_csv(filename) | Read CSV files; | | pd.read_excel(filename) | Read Excel files; | | pd.read_sql(query, connection_object) | Read data from SQL database; | | pd.read_json(json_string) | Read data from a JSON string; | | pd.read_html(url) | Read data from an HTML page. | ## Examples import pandas as pd # Read data from a CSV file df = pd.read_csv('data.csv') # Read data from an Excel file df = pd.read_excel('data.xlsx') # Read data from a SQL database import sqlite3 conn = sqlite3.connect('database.db') df = pd.read_sql('SELECT * FROM table_name', conn) # Read data from a JSON string json_string ='{"name": "John", "age": 30, "city": "New York"}' df = pd.read_json(json_string) # Read data from an HTML page url ='' dfs = pd.read_html(url) df = dfs# Select the first dataframe * * * ## Viewing Data | Function | Description | | --- | --- | | df.head(n) | Display the first n rows of data; | | df.tail(n) | Display the last n rows of data; | | df.info() | Display information about the data, including column names, data types, missing values, etc.; | | df.describe() | Display basic statistical information about the data, including mean, variance, maximum, minimum, etc.; | | df.shape | Display the number of rows and columns in the data. | ## Examples # Display the first five rows of data df.head() # Display the last five rows of data df.tail() # Display data information df.info() # Display basic statistical information df.describe() # Display the number of rows and columns in the data df.shape ## Example import pandas as pd data =[ {"name": "Google","likes": 25,"url": "https://www.google.com"}, {"name": "","likes": 30,"url": ""}, {"name": "Taobao","likes": 35,"url": "https://www.taobao.com"} ] df = pd.DataFrame(data) # Display the first two rows of data print(df.head(2)) # Display the last row of data print(df.tail(1)) The output of the above example is: name likes url 0 Google 25 https://www.google.com1 30 name likes url 2 Taobao 35 https://www.taobao.com * * * ## Data Cleaning | Function | Description | | --- | --- | | df.dropna() | Remove rows or columns containing missing values; | | df.fillna(value) | Replace missing values with a specified value; | | df.replace(old_value, new_value) | Replace a specified value with a new value; | | df.duplicated() | Check for duplicate data; | | df.drop_duplicates() | Remove duplicate data. | ## Examples # Remove rows or columns containing missing values df.dropna() # Replace missing values with a specified value df.fillna(0) # Replace a specified value with a new value df.replace('old_value','new_value') # Check for duplicate data df.duplicated() # Remove duplicate data df.drop_duplicates() * * * ## Data Selection and Slicing | Function | Description | | --- | --- | | df | Select a specified column; | | df.loc[row_index, column_name] | Select data by label; | | df.iloc[row_index, column_index] | Select data by position; | | df.ix[row_index, column_name] | Select data by label or position; | | df.filter(items=[column_name1, column_name2]) | Select specified columns; | | df.filter(regex='regex') | Select columns whose names match a regular expression; | | df.sample(n) | Randomly select n rows of data. | ## Examples # Select a specified column df['column_name'] # Select data by label df.loc[row_index, column_name] # Select data by position df.iloc[row_index, column_index] # Select data by label or position df.ix[row_index, column_name] # Select specified columns df.filter(items=['column_name1','column_name2']) # Select columns whose names match a regular expression df.filter(regex='regex') # Randomly select n rows of data df.sample(n=5) * * * ## Sorting Data | Function | Description | | --- | --- | | df.sort_values(column_name) | Sort by the values in a specified column; | | df.sort_values([column_name1, column_name2], ascending=[True, False]) | Sort by multiple column values; | | df.sort_index() | Sort by index. | ## Examples # Sort by the values in a specified column df.sort_values('column_name') # Sort by multiple column values df.sort_values(['column_name1','column_name2'], ascending=[True,False]) # Sort by index df.sort_index() * * * ## Grouping and Aggregation | Function | Description | | --- | --- | | df.groupby(column_name) | Group by a specified column; | | df.aggregate(function_name) | Perform aggregation operations on grouped data; | | df.pivot_table(values, index, columns, aggfunc) | Generate a pivot table. | ## Examples # Group by a specified column df.groupby('column_name') # Perform aggregation operations on grouped data df.aggregate('function_name') # Generate a pivot table df.pivot_table(values='value', index='index_column', columns='column_name', aggfunc='function_name') * * * ## Merging Data | Function | Description | | --- | --- | | pd.concat([df1, df2]) | Concatenate multiple dataframes along rows or columns; | | pd.merge(df1, df2, on=column_name) | Merge two dataframes based on a specified column. | ## Examples # Concatenate multiple dataframes along rows or columns df = pd.concat([df1, df2]) # Merge two dataframes based on a specified column df = pd.merge(df1, df2, on='column_name') * * * ## Data Selection and Filtering | Function | Description | | --- | --- | | df.loc[row_indexer, column_indexer] | Select rows and columns by label. | | df.iloc[row_indexer, column_indexer] | Select rows and columns by position. | | df[df['column_name'] > value] | Select rows where the column meets a condition. | | df.query('column_name > value') | Use a string expression to select rows where the column meets a condition. | * * * ## Data Statistics and Description | Function | Description | | --- | --- | | df.describe() | Calculate basic statistics such as mean, standard deviation, minimum, maximum, etc. | | df.mean() | Calculate the mean of each column. | | df.median() | Calculate the median of each column. | | df.mode() | Calculate the mode of each column. | | df.count() | Count the number of non-null values in each column. | * * * ## Example Suppose we have the following JSON data, saved in a file called data.json: ## data.json File [ { "name":"Alice", "age":25, "gender":"female", "score":80 }, { "name":"Bob", "age":null, "gender":"male", "score":90 }, { "name":"Charlie", "age":30, "gender":"male", "score":null }, { "name":"David", "age":35, "gender":"male", "score":70 } ] We can use Pandas to read the JSON data and perform data cleaning and processing, data selection and filtering, data statistics and description, etc., as follows: ## Example import pandas as pd # Read JSON data df = pd.read_json('data.json') # Drop missing values df = df.dropna() # Fill missing values with specified values df = df.fillna({'age': 0,'score': 0}) # Rename columns df = df.rename(columns={'name': 'Name','age': 'Age','gender': 'Gender','score': 'Score'}) # Sort by score df = df.sort_values(by='Score', ascending=False) # Group by gender and calculate average age and score grouped = df.groupby('Gender').agg({'Age': 'mean','Score': 'mean'}) # Select rows where score is greater than or equal to 90, keeping only name and score columns df = df.loc[df['Score']>=90,['Name','Score']] # Calculate basic statistics for each column stats = df.describe() # Calculate the mean of each column mean = df.mean() # Calculate the median of each column median = df.median() # Calculate the mode of each column mode = df.mode() # Count the number of non-null values in each column count = df.count() The output results are as follows: # df Name Age Gender Score1 Bob 0 male 90# grouped Age ScoreGender female 25.000000 80 male 27.500000 80# stats Score count 1.0 mean 90.0 std NaN min 90.025% 90.050% 90.075% 90.0 max 90.0# meanScore 90.0 dtype: float64 # medianScore 90.0 dtype: float64 # mode Name Score0 Bob 90.0# countName 1Score 1 dtype: int64
← Jsref At ArrayHttp2 Tutorial β†’