Dirty CRM data is a silent revenue killer. Duplicate contacts get routed to different reps. Stale opportunities inflate pipeline forecasts. Inconsistent company names break territory assignments. Most teams know the data is bad but cleaning it manually feels impossible. A Python script running on a nightly schedule solves this without anyone lifting a finger.

What the Script Cleans

The cleanup script handles three categories of data quality issues:

Issue Example Fix
Inconsistent company names “Acme Inc”, “Acme, Inc.”, “ACME” Standardize to canonical form
Duplicate contacts Same email across multiple records Flag and merge candidates
Stale records Opportunities unchanged for 90+ days Flag for review or archive

Standardizing Company Names

Company name inconsistency is the most common data quality issue in any CRM. The fuzzywuzzy library handles approximate matching:

from fuzzywuzzy import fuzz, process
import pandas as pd

def standardize_companies(accounts):
    canonical_names = {}
    sorted_accounts = sorted(accounts, key=lambda x: x["Name"])

    for account in sorted_accounts:
        name = account["Name"].strip()
        match, score = process.extractOne(
            name,
            list(canonical_names.keys()) or [name],
            scorer=fuzz.token_sort_ratio,
        )
        if score >= 90 and match != name:
            canonical_names.setdefault(match, []).append(account["Id"])
            print(f"  MATCH: '{name}' -> '{match}' (score: {score})")
        else:
            canonical_names[name] = []

    return canonical_names

Install the dependency:

pip install fuzzywuzzy python-Levenshtein simple-salesforce

Deduplicating Contacts

Duplicate contacts create routing nightmares. This function groups contacts by email and identifies merge candidates:

def find_duplicate_contacts(sf):
    query = """
        SELECT Id, FirstName, LastName, Email, AccountId, CreatedDate
        FROM Contact
        WHERE Email != null
        ORDER BY Email, CreatedDate ASC
    """
    results = sf.query_all(query)
    contacts = pd.DataFrame(results["records"]).drop(columns=["attributes"])

    duplicates = contacts[contacts.duplicated(subset=["Email"], keep=False)]
    grouped = duplicates.groupby("Email").apply(
        lambda g: {
            "email": g.name,
            "count": len(g),
            "keep_id": g.iloc[0]["Id"],       # Keep the oldest record
            "merge_ids": g.iloc[1:]["Id"].tolist(),
        }
    )
    return list(grouped)

Flagging Stale Records

Opportunities that have not been touched in 90 days are likely dead. Flag them for review rather than deleting them outright:

def flag_stale_opportunities(sf, days=90):
    query = f"""
        SELECT Id, Name, StageName, LastModifiedDate, Owner.Name
        FROM Opportunity
        WHERE IsClosed = false
        AND LastModifiedDate < LAST_N_DAYS:{days}
    """
    results = sf.query_all(query)
    stale = results["records"]

    for opp in stale:
        sf.Opportunity.update(opp["Id"], {
            "Description": f"[AUTO-FLAGGED] No activity in {days}+ days. "
                           f"Previous stage: {opp['StageName']}",
        })

    return len(stale)

Running It All Together

The main script chains each cleanup step and logs results:

import logging
from simple_salesforce import Salesforce
import os

logging.basicConfig(level=logging.INFO, filename="cleanup.log")

def main():
    sf = Salesforce(
        username=os.environ["SF_USERNAME"],
        password=os.environ["SF_PASSWORD"],
        security_token=os.environ["SF_TOKEN"],
    )

    logging.info("Starting CRM cleanup run")

    accounts = sf.query_all("SELECT Id, Name FROM Account")["records"]
    canonical = standardize_companies(accounts)
    logging.info(f"Found {len(canonical)} canonical company names")

    dupes = find_duplicate_contacts(sf)
    logging.info(f"Found {len(dupes)} duplicate contact groups")

    stale_count = flag_stale_opportunities(sf, days=90)
    logging.info(f"Flagged {stale_count} stale opportunities")

if __name__ == "__main__":
    main()

Schedule it nightly with cron:

0 2 * * * cd /opt/revops-scripts && python crm_cleanup.py

Key Takeaways

  • Automated CRM cleanup running on a nightly schedule prevents data quality from degrading between manual audits
  • Fuzzy matching with a 90% similarity threshold catches most company name variants without creating false positives
  • Always run cleanup scripts in dry-run or preview mode first before enabling writes to your production CRM
  • Flagging stale records rather than deleting them gives reps a chance to review before data is lost