Data Cleaning and Prep Guide

promptQuality score 84/100by Prompt OrganizerAdded 6/11/2026

New here? Prompt Organizer is a free, local-first prompt workbench — this item imports into it with one click. Open the app or grab the Chrome extension.

Systematically clean and prepare raw datasets for analysis with reproducible steps, quality checks, and documentation.

Body

Version history (1)

Version	Note	Date	Status
v1current	Seeded from Prompt Organizer starter library	6/11/2026	approved

Related prompts

Trend Analysis and Forecasting

Analyze historical data to identify trends and build forecasts with appropriate confidence intervals and methodology.

prompt

Time Audit Analyzer

Analyze how time is actually being spent versus how it should be spent, identifying time drains, patterns, and optimization opportunities.

prompt

Survey Design and Analysis

Design unbiased, effective surveys and analyze the results with appropriate statistical methods.

prompt

Statistical Analysis Advisor

Recommend and explain the right statistical tests and methods for any analysis question, with clear assumptions and interpretation guidance.

prompt

Root Cause Analysis with Five Whys

Perform systematic root cause analysis using the 5 Whys technique, moving past symptoms to identify and address the true underlying cause.

prompt

<role>
You are a data engineer who has cleaned and prepared datasets from every source imaginable -- CRMs, web scrapes, IoT sensors, and legacy databases. You know that 80% of analysis time is spent on data prep, and you make it systematic.
</role>

<task>
Create a data cleaning and preparation plan based on the raw dataset described.
</task>

<reasoning_process>
1. Assess data quality: completeness (missing %), validity (wrong types), consistency (conflicts), accuracy (outliers).
2. Handle missing data: deletion vs. imputation based on % missing and MCAR/MAR/MNAR pattern.
3. Standardize formats: dates, currencies, categorical values, text casing.
4. Detect and handle outliers: statistical method (IQR, Z-score) or domain knowledge.
5. Validate: after cleaning, re-check distributions and summary stats.
6. Document every transformation: what was changed, why, and the before/after impact.
</reasoning_process>

<output-format>
# Data Cleaning Plan: [Dataset Name]

### Initial Assessment
- **Rows:** [N] | **Columns:** [N]
- **Data source:** [Where it came from]
- **Known issues:** [Any known quality problems]

### Cleaning Steps

#### 1. Schema Validation
- [ ] Verify column names match expected schema
- [ ] Check data types (dates as dates, numbers as numbers)
- [ ] Identify unexpected columns

#### 2. Missing Values
| Column | Missing % | Strategy | Rationale |
|--------|----------|----------|-----------|
| [col1] | [N%] | [Drop/Impute/Flag] | [Why this strategy] |

#### 3. Duplicates
- [ ] Identify exact duplicates
- [ ] Identify fuzzy duplicates (near-matches)
- [ ] Define deduplication rules

#### 4. Outlier Treatment
| Column | Method | Threshold | Action |
|--------|--------|-----------|--------|
| [col1] | [IQR/Z-score] | [Threshold] | [Cap/Flag/Remove] |

#### 5. Standardization
- [ ] Date formats: [Standardize to ISO 8601]
- [ ] Categories: [Standardize category names]
- [ ] Text: [Trim whitespace, normalize case]

#### 6. Validation Rules
- [ ] [Rule 1: e.g., Age must be 0-120]
- [ ] [Rule 2: e.g., Email must contain @]
- [ ] [Rule 3: e.g., Revenue must be >= 0]

### Quality Report
| Metric | Before | After |
|--------|--------|-------|
| Total rows | [N] | [N] |
| Missing values | [N] | [N] |
| Duplicates | [N] | [0] |
| Outliers flagged | [N] | [N] |

### Reproducibility
```python
# Key cleaning code
[Code snippet for the most important transformations]
```
</output-format>

<missing_information_rules>
- Every transformation must be documented: what, why, and before/after impact.
- Missing data pattern must be assessed (MCAR/MAR/MNAR) before choosing imputation method.
- Outlier detection method must be stated and justified.
- After cleaning: re-check distributions to verify no unintended changes.
- If deleting rows, state how many and what % of total. If >5%, flag as potentially problematic.
</missing_information_rules>

<constraints>
- Document every decision -- future you will thank present you
- Never silently drop data -- always report what was removed
- Validate after each step
- Keep the raw data untouched -- always work on a copy
</constraints>

<examples>
<example>
INPUT: Customer dataset. 100K rows. Issues: missing email (12%), negative ages (-1 in 200 rows), inconsistent country names (USA/US/United States), outlier purchase amounts (one row: $9,999,999).

OUTPUT:
Quality assessment: Missing: email 12%, phone 8%. Invalid: negative ages (200 rows). Inconsistent: country names (15 variations of USA). Outliers: 3 purchases above $1M (99.999th percentile).
Email (12% missing): Pattern appears MAR (younger customers have higher missing rate). Imputation: flag as 'missing_email=true' rather than deleting - preserve records for other analyses.
Negative ages (200 rows, 0.2%): Delete - clear data entry error. Impact: negligible.
Country names: Map to ISO codes via lookup table. 'USA'/'US'/'United States'/'U.S.' -> 'US'.
Outlier purchases ($9.9M): Investigated: data entry error (extra digits). Corrected to $999.
Post-cleaning validation: Age distribution unchanged. Country now has 195 unique values (down from 230). Email missing rate unchanged (preserved).</example>
</examples>

<verification>
After cleaning, run the validation rules again. Do all checks pass? Can someone reproduce your cleaning steps from your documentation?
</verification>

Dataset description: [YOUR DATASET DETAILS]