Pandas Missing Value Handling |
\n\nReal-world data often contains missing values (NaN). Pandas provides a rich set of functions to handle missing data. This section details the usage of methods such as fillna, dropna, and interpolate.
\n\n
Representation of Missing Values
\n\nIn Pandas, missing values are represented using NaN (Not a Number), which originates from the NumPy library.
Example
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Create data containing missing values\n\ns = pd.Series([1,2, np.nan,4,5])\n\nprint("Series containing NaN:")\n\nprint(s)\n\nprint(f"Has missing values: {s.isna().any()}")\n\nprint()\n\n# Missing values in DataFrame\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4],\n\n"B": [np.nan,2,3,4],\n\n"C": [1,2,3, np.nan]\n\n})\n\nprint("DataFrame containing missing values:")\n\nprint(df)\n\nprint()\n\n# Detect missing values\n\nprint("Positions of missing values:")\n\nprint(df.isna())\n\nprint()\n\n# Count missing values per column\n\nprint("Number of missing values per column:")\n\nprint(df.isna().sum())\n\n\n\nIn Pandas,\n\nNaN,None, andpandas.NAare all recognized as missing values. Useisna()orisnull()to uniformly detect these missing values.
\n\n
dropna: Remove Missing Values
\n\nRemove Rows
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4],\n\n"B": [1, np.nan,3,4],\n\n"C": [1,2,3, np.nan]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Remove rows containing missing values (default behavior)\n\nprint("Remove rows with missing values:")\n\nprint(df.dropna())\n\nprint()\n\n# how='all': only remove rows where all values are missing\n\nprint("Only remove rows where all values are missing:")\n\nprint(df.dropna(how="all"))\n\nprint()\n\n# thresh: retain rows with at least N non-missing values\n\nprint("Retain rows with at least 2 non-missing values:")\n\nprint(df.dropna(thresh=2))\n\n\nRemove Columns
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4],\n\n"B": [np.nan, np.nan, np.nan, np.nan],# all missing\n\n"C": [1,2,3,4]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Remove columns containing missing values\n\nprint("Remove columns with missing values:")\n\nprint(df.dropna(axis=1))\n\nprint()\n\n# how='all': remove columns where all values are missing\n\nprint("Remove columns where all values are missing:")\n\nprint(df.dropna(axis=1, how="all"))\n\n\n\n\n
fillna: Fill Missing Values
\n\nFill with Fixed Values
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4,5],\n\n"B": [np.nan,2,3, np.nan,5],\n\n"C": [1,2,3,4, np.nan]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Fill with 0\n\nprint("Fill with 0:")\n\nprint(df.fillna(0))\n\nprint()\n\n# Fill different columns with different values\n\nprint("Fill different columns with different values:")\n\nprint(df.fillna({"A": 0,"B": 99,"C": -1}))\n\nprint()\n\n# Forward fill (use previous value)\n\nprint("Forward fill:")\n\nprint(df.fillna(method="ffill"))\n\nprint()\n\n# Backward fill (use next value)\n\nprint("Backward fill:")\n\nprint(df.fillna(method="bfill"))\n\n\nFill with Statistical Values
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4,5,6],\n\n"B": [10, np.nan,30, np.nan,50,60]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Fill with mean\n\nprint("Fill with mean:")\n\nprint(df.fillna(df.mean()))\n\nprint()\n\n# Fill with median\n\nprint("Fill with median:")\n\nprint(df.fillna(df.median()))\n\nprint()\n\n# Fill each column separately\n\nprint("Fill column A with mean, column B with 0:")\n\nprint(df.fillna({"A": df.mean(),"B": 0}))\n\n\nForward and backward filling are especially useful for time series data, helping preserve continuity. The choice depends on the business meaning of the data.\n\n
\n\n
interpolate: Interpolation-based Filling
\n\nInterpolation is a smarter filling method that estimates missing values based on neighboring data points.
\n\nLinear Interpolation
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\ns = pd.Series([1,2, np.nan,4,5, np.nan,7])\n\nprint("Original Series:")\n\nprint(s)\n\nprint()\n\n# Linear interpolation (default)\n\nprint("Linear interpolation:")\n\nprint(s.interpolate())\n\nprint()\n\n# Specify interpolation method\n\nprint("Time-weighted interpolation (for time series):")\n\ns2 = pd.Series([1, np.nan, np.nan,4], index=pd.date_range("2024-01-01", periods=4, freq="D"))\n\nprint(s2.interpolate(method="time"))\n\nprint()\n\n# Nearest neighbor interpolation\n\nprint("Fill with nearest value:")\n\nprint(s.interpolate(method="nearest"))\n\n\nInterpolation for DataFrame
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4,5],\n\n"B": [10, np.nan,30,40, np.nan]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Interpolate directly on the DataFrame\n\ndf_interpolated = df.interpolate(method="linear")\n\nprint("After linear interpolation:")\n\nprint(df_interpolated)\n\nprint()\n\n# Limit interpolation range\n\nprint("Fill at most 1 missing value at the start/end of consecutive missing segments:")\n\nprint(df.interpolate(limit=1))\n\n\n\n\n
Practical Example: Handling Real Data
\n\nExample
\n\nimport pandas as pd\n\nimport numpy as np\n\n# Simulate real business data\n\nnp.random.seed(42)\n\nn = 20\n\ndf = pd.DataFrame({\n\n"Date": pd.date_range("2024-01-01", periods=n),\n\n"Sales Revenue": np.random.choice([100,200, np.nan,300, np.nan], n),\n\n"Customer Count": np.random.choice([10, np.nan,20,30], n),\n\n"Conversion Rate": np.random.choice([0.05,0.1, np.nan,0.15], n)\n\n})\n\nprint("Missing value counts in original data:")\n\nprint(df.isna().sum())\n\nprint()\n\n# Handling strategy:\n\n# 1. Fill sales using interpolation (business allows fluctuations)\n\ndf = df.interpolate()\n\n# 2. Fill customer count with mean\n\ndf = df.fillna(df.mean())\n\n# 3. Fill conversion rate with 0 (indicating no conversion)\n\ndf = df.fillna(0)\n\nprint("Data after handling:")\n\nprint(df)\n\nprint()\n\nprint("Missing value counts after handling:")\n\nprint(df.isna().sum())\n\n\n\n\n
Frequently Asked Questions
\n\n1. Does fillna create a new object or modify in place?
\nBy default, fillna returns a new object. Use inplace=True to modify in place.
2. Missing values remain after interpolation?
\nIf missing values appear at the beginning or end of a sequence, interpolation cannot fill themβadditional handling is required.
3. Distinguish between NaN and empty strings
\nEmpty strings "" are not missing values; convert them using replace("", np.nan).
When choosing a filling method, consider the business context: forward filling suits time series, backward filling suits static data, mean filling suits numerical data, and interpolation suits continuous data.
YouTip