YouTip LogoYouTip

Pandas Missing Data

Pandas Missing Value Handling |

\n\n

Real-world data often contains missing values (NaN). Pandas provides a rich set of functions to handle missing data. This section details the usage of methods such as fillna, dropna, and interpolate.

\n\n
\n\n

Representation of Missing Values

\n\n

In Pandas, missing values are represented using NaN (Not a Number), which originates from the NumPy library.

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\n# Create data containing missing values\n\ns = pd.Series([1,2, np.nan,4,5])\n\nprint("Series containing NaN:")\n\nprint(s)\n\nprint(f"Has missing values: {s.isna().any()}")\n\nprint()\n\n# Missing values in DataFrame\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4],\n\n"B": [np.nan,2,3,4],\n\n"C": [1,2,3, np.nan]\n\n})\n\nprint("DataFrame containing missing values:")\n\nprint(df)\n\nprint()\n\n# Detect missing values\n\nprint("Positions of missing values:")\n\nprint(df.isna())\n\nprint()\n\n# Count missing values per column\n\nprint("Number of missing values per column:")\n\nprint(df.isna().sum())\n\n
\n\n
In Pandas, NaN, None, and pandas.NA are all recognized as missing values. Use isna() or isnull() to uniformly detect these missing values.
\n\n
\n\n

dropna: Remove Missing Values

\n\n

Remove Rows

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4],\n\n"B": [1, np.nan,3,4],\n\n"C": [1,2,3, np.nan]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Remove rows containing missing values (default behavior)\n\nprint("Remove rows with missing values:")\n\nprint(df.dropna())\n\nprint()\n\n# how='all': only remove rows where all values are missing\n\nprint("Only remove rows where all values are missing:")\n\nprint(df.dropna(how="all"))\n\nprint()\n\n# thresh: retain rows with at least N non-missing values\n\nprint("Retain rows with at least 2 non-missing values:")\n\nprint(df.dropna(thresh=2))\n
\n\n

Remove Columns

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4],\n\n"B": [np.nan, np.nan, np.nan, np.nan],# all missing\n\n"C": [1,2,3,4]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Remove columns containing missing values\n\nprint("Remove columns with missing values:")\n\nprint(df.dropna(axis=1))\n\nprint()\n\n# how='all': remove columns where all values are missing\n\nprint("Remove columns where all values are missing:")\n\nprint(df.dropna(axis=1, how="all"))\n
\n\n
\n\n

fillna: Fill Missing Values

\n\n

Fill with Fixed Values

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4,5],\n\n"B": [np.nan,2,3, np.nan,5],\n\n"C": [1,2,3,4, np.nan]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Fill with 0\n\nprint("Fill with 0:")\n\nprint(df.fillna(0))\n\nprint()\n\n# Fill different columns with different values\n\nprint("Fill different columns with different values:")\n\nprint(df.fillna({"A": 0,"B": 99,"C": -1}))\n\nprint()\n\n# Forward fill (use previous value)\n\nprint("Forward fill:")\n\nprint(df.fillna(method="ffill"))\n\nprint()\n\n# Backward fill (use next value)\n\nprint("Backward fill:")\n\nprint(df.fillna(method="bfill"))\n
\n\n

Fill with Statistical Values

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4,5,6],\n\n"B": [10, np.nan,30, np.nan,50,60]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Fill with mean\n\nprint("Fill with mean:")\n\nprint(df.fillna(df.mean()))\n\nprint()\n\n# Fill with median\n\nprint("Fill with median:")\n\nprint(df.fillna(df.median()))\n\nprint()\n\n# Fill each column separately\n\nprint("Fill column A with mean, column B with 0:")\n\nprint(df.fillna({"A": df.mean(),"B": 0}))\n
\n\n
Forward and backward filling are especially useful for time series data, helping preserve continuity. The choice depends on the business meaning of the data.
\n\n
\n\n

interpolate: Interpolation-based Filling

\n\n

Interpolation is a smarter filling method that estimates missing values based on neighboring data points.

\n\n

Linear Interpolation

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\ns = pd.Series([1,2, np.nan,4,5, np.nan,7])\n\nprint("Original Series:")\n\nprint(s)\n\nprint()\n\n# Linear interpolation (default)\n\nprint("Linear interpolation:")\n\nprint(s.interpolate())\n\nprint()\n\n# Specify interpolation method\n\nprint("Time-weighted interpolation (for time series):")\n\ns2 = pd.Series([1, np.nan, np.nan,4], index=pd.date_range("2024-01-01", periods=4, freq="D"))\n\nprint(s2.interpolate(method="time"))\n\nprint()\n\n# Nearest neighbor interpolation\n\nprint("Fill with nearest value:")\n\nprint(s.interpolate(method="nearest"))\n
\n\n

Interpolation for DataFrame

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\ndf = pd.DataFrame({\n\n"A": [1,2, np.nan,4,5],\n\n"B": [10, np.nan,30,40, np.nan]\n\n})\n\nprint("Original data:")\n\nprint(df)\n\nprint()\n\n# Interpolate directly on the DataFrame\n\ndf_interpolated = df.interpolate(method="linear")\n\nprint("After linear interpolation:")\n\nprint(df_interpolated)\n\nprint()\n\n# Limit interpolation range\n\nprint("Fill at most 1 missing value at the start/end of consecutive missing segments:")\n\nprint(df.interpolate(limit=1))\n
\n\n
\n\n

Practical Example: Handling Real Data

\n\n

Example

\n\n
import pandas as pd\n\nimport numpy as np\n\n# Simulate real business data\n\nnp.random.seed(42)\n\nn = 20\n\ndf = pd.DataFrame({\n\n"Date": pd.date_range("2024-01-01", periods=n),\n\n"Sales Revenue": np.random.choice([100,200, np.nan,300, np.nan], n),\n\n"Customer Count": np.random.choice([10, np.nan,20,30], n),\n\n"Conversion Rate": np.random.choice([0.05,0.1, np.nan,0.15], n)\n\n})\n\nprint("Missing value counts in original data:")\n\nprint(df.isna().sum())\n\nprint()\n\n# Handling strategy:\n\n# 1. Fill sales using interpolation (business allows fluctuations)\n\ndf = df.interpolate()\n\n# 2. Fill customer count with mean\n\ndf = df.fillna(df.mean())\n\n# 3. Fill conversion rate with 0 (indicating no conversion)\n\ndf = df.fillna(0)\n\nprint("Data after handling:")\n\nprint(df)\n\nprint()\n\nprint("Missing value counts after handling:")\n\nprint(df.isna().sum())\n
\n\n
\n\n

Frequently Asked Questions

\n\n

1. Does fillna create a new object or modify in place?
\nBy default, fillna returns a new object. Use inplace=True to modify in place.

\n\n

2. Missing values remain after interpolation?
\nIf missing values appear at the beginning or end of a sequence, interpolation cannot fill themβ€”additional handling is required.

\n\n

3. Distinguish between NaN and empty strings
\nEmpty strings "" are not missing values; convert them using replace("", np.nan).

\n\n
When choosing a filling method, consider the business context: forward filling suits time series, backward filling suits static data, mean filling suits numerical data, and interpolation suits continuous data.
← Pandas StringPandas Loc Iloc β†’