YouTip LogoYouTip

Pandas Cleaning

Data cleaning is the process of processing some useless data. Many datasets have data missing, data format errors, wrong data or duplicate data. If you want to make data analysis more accurate, you need to process this useless data. Common steps for data cleaning and preprocessing: 1. **Missing Value Processing**: Identify and fill missing values, or delete rows/columns containing missing values. 2. **Duplicate Data Processing**: Check and delete duplicate data to ensure each piece of data is unique. 3. **Outlier Processing**: Identify and process outliers, such as extreme values or wrong values. 4. **Data Format Conversion**: Convert data types or perform unit conversions, such as date format conversion. 5. **Standardization and Normalization**: Standardize (such as Z-score) or normalize (such as Min-Max) numeric data. 6. **Categorical Data Encoding**: Convert categorical variables to numeric form. Common methods include One-Hot Encoding and Label Encoding. 7. **Text Processing**: Clean text data, such as removing stop words, stemming, tokenization, etc. 8. **Data Sampling**: Sample from a dataset, or handle class imbalance through oversampling/undersampling. 9. **Feature Engineering**: Create new features, remove irrelevant features, select important features, etc. In this tutorial, we will use the Pandas package for data cleaning. The test data used in this article [property-data.csv](https://static.jyshare.com/download/property-data.csv) is as follows: !(#) The above table contains four types of empty data: * n/a * NA * β€” * na * * * ## Pandas Cleaning Empty Values If we want to delete rows containing empty fields, we can use the **dropna()** method. The syntax is as follows: DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) **Parameter Description:** * axis: Default is **0**, which means to remove the entire row when encountering empty values. If parameter **axis=1** is set, it means to remove the entire column when encountering empty values. * how: Default is **'any'**. If any data in a row (or column) has NA, the entire row is removed. If **how='all'** is set, the row (or column) is removed only if all are NA. * thresh: Set how many non-empty values are required to keep the data. * subset: Set which columns to check. If multiple columns, you can use a list of column names as parameters. * inplace: If set to True, the calculated value directly overwrites the previous value and returns None, modifying the source data. We can use isnull() to determine whether each cell is empty. ## Example import pandas as pd df = pd.read_csv('property-data.csv') print(df['NUM_BEDROOMS']) print(df['NUM_BEDROOMS'].isnull()) The output of the above example is as follows: !(#) In the above example, we see that Pandas treats n/a and NA as empty data, but na is not empty data, which does not meet our requirements. We can specify the empty data type: ## Example import pandas as pd missing_values =["n/a","na","--"] df = pd.read_csv('property-data.csv', na_values = missing_values) print(df['NUM_BEDROOMS']) print(df['NUM_BEDROOMS'].isnull()) The output of the above example is as follows: !(#) The following example demonstrates deleting rows containing empty data. ## Example import pandas as pd df = pd.read_csv('property-data.csv') new_df = df.dropna() print(new_df.to_string()) The output of the above example is as follows: !(#) **Note:** By default, the dropna() method returns a new DataFrame and does not modify the source data. If you want to modify the source DataFrame, you can use the **inplace = True** parameter: ## Example import pandas as pd df = pd.read_csv('property-data.csv') df.dropna(inplace =True) print(df.to_string()) The output of the above example is as follows: !(#) We can also remove rows with empty values in specified columns: ## Example Remove rows where the ST_NUM column field value is empty: import pandas as pd df = pd.read_csv('property-data.csv') df.dropna(subset=['ST_NUM'], inplace =True) print(df.to_string()) The output of the above example is as follows: !(#) We can also use the **fillna()** method to replace some empty fields: ## Example Use 12345 to replace empty fields: import pandas as pd df = pd.read_csv('property-data.csv') df.fillna(12345, inplace =True) print(df.to_string()) The output of the above example is as follows: !(#) We can also specify a particular column to replace data: ## Example Use 12345 to replace empty data for PID: import pandas as pd df = pd.read_csv('property-data.csv') df['PID'].fillna(12345, inplace =True) print(df.to_string()) The output of the above example is as follows: !(#) A common method to replace empty cells is to calculate the mean, median, or mode of the column. Pandas uses the mean(), median(), and mode() methods to calculate the mean (average of all values), median (the middle number after sorting), and mode (the most frequent number) of a column. ## Example Use the mean() method to calculate the mean of a column and replace empty cells: import pandas as pd df = pd.read_csv('property-data.csv') x = df.mean() df.fillna(x, inplace =True) print(df.to_string()) The output of the above example is as follows. The red box shows the mean calculated to replace empty cells: !(#) ## Example Use the median() method to calculate the median of a column and replace empty cells: import pandas as pd df = pd.read_csv('property-data.csv') x = df.median() df.fillna(x, inplace =True) print(df.to_string()) The output of the above example is as follows. The red box shows the median calculated to replace empty cells: !(#) ## Example Use the mode() method to calculate the mode of a column and replace empty cells: import pandas as pd df = pd.read_csv('property-data.csv') x = df.mode() df.fillna(x, inplace =True) print(df.to_string()) The output of the above example is as follows. The red box shows the mode calculated to replace empty cells: !(
← Matplotlib TutorialFunc Array Key First β†’