Data Errors¶
identify_errors¶
-
pywrangle.data_errors.identify_errors.identify_errors(df: dataframe, column: str, threshold: int = 65, show_progress: bool = False, limit: int = 5) → None¶ Prints potential data errors in the specified DataFrame column.
- Parameters
df (dataframe) – DataFrame.
column (str) – Column in DataFrame to check.
threshold (int) – Rigor threshold to identify potential data errors. A higher threshold returns more rigorous matching. Defaults to 65 out of 100.
show_progress (bool) – Prints matching progress to console. Defaults to False.
limit (int) – Limits the number of matches to each string. Higher values increase computation time and return more false positives. Defaults to 5.
Notes
Data entry errors are identified based on a Similarity Index.
The Similarity Index is calculated using algorithm’s derived from levenshtein’s distance and doublemetaphone.
Example
>>> df = create_df.create_str_df2() ## Identify potential errors in the state column >>> pw.identify_errors(df= df, column= 'states', threshold= 70) Record | String | Match | Similarity Index ------ | ------------ | ------------ | ---------------- 1 | california | californi as | 92.75 2 | california | californi a | 97.0 3 | california | californias | 94.25 4 | california | cali fornia | 96.0
converge_sim_vals¶
-
pywrangle.data_errors.converge_sim_vals.converge_sim_vals(df: DataFrame, column: str, values: Union[tuple, list], index: int) → DataFrame¶ Returns DataFrame with similar values ‘converged’ to the value at index.
- Parameters
df (DataFrame) – DataFrame to change.
column (str) – Column name.
values (Union[tuple, list]) – Values to change.
index (int) – index in values for similar values to converge.
Notes
This function can be called after identifying errors with the identify_errors function.
Example
>>> df = create_df.create_str_df4() >>> print(df) Index States 0 1 california 1 2 california 2 3 cali fornia 3 4 californias 4 5 californi a Index(['Index', 'States'], dtype='object') >>> values = ['california', 'cali fornia', 'californias', 'californi a'] >>> df = pw.converge_sim_vals(df= df, column= 'States', values= values, index= 0) >>> print(df) Index States 0 1 california 1 2 california 2 3 california 3 4 california 4 5 california Index(['Index', 'States'], dtype='object')