Will this script modify records directly in Salesforce?

The script can update records via the Salesforce API, but the examples here use a dry-run approach first. Always run in preview mode to review proposed changes before enabling writes. This protects you from bulk updates that could damage your data.

How do I handle edge cases like subsidiaries that share a parent company name?

Add an exclusion list of known subsidiaries that should remain separate. You can maintain this as a simple CSV file that the script loads at runtime. This prevents the fuzzy matching from incorrectly merging records for companies like Google and Google Cloud that are intentionally distinct.

A Python Script That Cleans Your CRM Data While You Sleep

Dirty CRM data is a silent revenue killer. Duplicate contacts get routed to different reps. Stale opportunities inflate pipeline forecasts. Inconsistent company names break territory assignments. Most teams know the data is bad but cleaning it manually feels impossible. A Python script running on a nightly schedule solves this without anyone lifting a finger.

What the Script Cleans
Standardizing Company Names
Deduplicating Contacts
Flagging Stale Records
Running It All Together
Key Takeaways

What the Script Cleans¶

The cleanup script handles three categories of data quality issues:

Issue	Example	Fix
Inconsistent company names	“Acme Inc”, “Acme, Inc.”, “ACME”	Standardize to canonical form
Duplicate contacts	Same email across multiple records	Flag and merge candidates
Stale records	Opportunities unchanged for 90+ days	Flag for review or archive

Standardizing Company Names¶

Company name inconsistency is the most common data quality issue in any CRM. The fuzzywuzzy library handles approximate matching:

from fuzzywuzzy import fuzz, process
import pandas as pd

def standardize_companies(accounts):
    canonical_names = {}
    sorted_accounts = sorted(accounts, key=lambda x: x["Name"])

    for account in sorted_accounts:
        name = account["Name"].strip()
        match, score = process.extractOne(
            name,
            list(canonical_names.keys()) or [name],
            scorer=fuzz.token_sort_ratio,
        )
        if score >= 90 and match != name:
            canonical_names.setdefault(match, []).append(account["Id"])
            print(f"  MATCH: '{name}' -> '{match}' (score: {score})")
        else:
            canonical_names[name] = []

    return canonical_names

Install the dependency:

pip install fuzzywuzzy python-Levenshtein simple-salesforce

Deduplicating Contacts¶

Duplicate contacts create routing nightmares. This function groups contacts by email and identifies merge candidates:

def find_duplicate_contacts(sf):
    query = """
        SELECT Id, FirstName, LastName, Email, AccountId, CreatedDate
        FROM Contact
        WHERE Email != null
        ORDER BY Email, CreatedDate ASC
    """
    results = sf.query_all(query)
    contacts = pd.DataFrame(results["records"]).drop(columns=["attributes"])

    duplicates = contacts[contacts.duplicated(subset=["Email"], keep=False)]
    grouped = duplicates.groupby("Email").apply(
        lambda g: {
            "email": g.name,
            "count": len(g),
            "keep_id": g.iloc[0]["Id"],       # Keep the oldest record
            "merge_ids": g.iloc[1:]["Id"].tolist(),
        }
    )
    return list(grouped)

Flagging Stale Records¶

Opportunities that have not been touched in 90 days are likely dead. Flag them for review rather than deleting them outright:

def flag_stale_opportunities(sf, days=90):
    query = f"""
        SELECT Id, Name, StageName, LastModifiedDate, Owner.Name
        FROM Opportunity
        WHERE IsClosed = false
        AND LastModifiedDate < LAST_N_DAYS:{days}
    """
    results = sf.query_all(query)
    stale = results["records"]

    for opp in stale:
        sf.Opportunity.update(opp["Id"], {
            "Description": f"[AUTO-FLAGGED] No activity in {days}+ days. "
                           f"Previous stage: {opp['StageName']}",
        })

    return len(stale)

Running It All Together¶

The main script chains each cleanup step and logs results:

import logging
from simple_salesforce import Salesforce
import os

logging.basicConfig(level=logging.INFO, filename="cleanup.log")

def main():
    sf = Salesforce(
        username=os.environ["SF_USERNAME"],
        password=os.environ["SF_PASSWORD"],
        security_token=os.environ["SF_TOKEN"],
    )

    logging.info("Starting CRM cleanup run")

    accounts = sf.query_all("SELECT Id, Name FROM Account")["records"]
    canonical = standardize_companies(accounts)
    logging.info(f"Found {len(canonical)} canonical company names")

    dupes = find_duplicate_contacts(sf)
    logging.info(f"Found {len(dupes)} duplicate contact groups")

    stale_count = flag_stale_opportunities(sf, days=90)
    logging.info(f"Flagged {stale_count} stale opportunities")

if __name__ == "__main__":
    main()

Schedule it nightly with cron:

0 2 * * * cd /opt/revops-scripts && python crm_cleanup.py

Key Takeaways¶

Automated CRM cleanup running on a nightly schedule prevents data quality from degrading between manual audits
Fuzzy matching with a 90% similarity threshold catches most company name variants without creating false positives
Always run cleanup scripts in dry-run or preview mode first before enabling writes to your production CRM
Flagging stale records rather than deleting them gives reps a chance to review before data is lost