How We Made Our Legacy Crons 25x Faster

Legacy code is slow for a reason. Usually several. The hard part isn't fixing the slowness — it's finding which of the many possible causes is actually the bottleneck.

This is the story of how we made Simplilearn's legacy PHP cron jobs 25x faster, and why the debugging approach mattered as much as the fix.

The Problem

Simplilearn's platform runs on a mix of legacy PHP systems and newer Next.js/NestJS applications. The legacy side includes a set of cron jobs — scheduled PHP scripts handling things like enrollment processing, notification dispatch, reporting aggregation, and data sync.

These crons were slow. Not slow in a way that had been ignored — slow in a way that had accumulated over time as the platform grew. Jobs that had originally been designed for a certain data volume were now running against a platform serving 1M+ users. The execution times had drifted upward as the data grew, and the jobs were taking long enough that some were either running into overlap with subsequent executions or degrading the reliability of the processes they supported.

The mandate was clear: find out why, and fix it.

Why Standard Profiling Wasn't the Right Tool

The instinct when something is slow is to reach for a profiler. Attach xdebug, generate a trace, open it in a flame graph viewer, look for the hot path.

The problem with that approach on legacy cron jobs is that profilers answer "where is time being spent inside the code?" They don't answer "where in the job's logical execution sequence does time get wasted?" Those are different questions.

A profiler might tell you that 40% of execution time is in database calls — but it won't tell you that 90% of those database calls are happening in a step that could be eliminated entirely, or that a data transformation step is doing work that was already done in a previous step and never cached.

For logic-heavy batch jobs with multiple sequential stages, profiling at the function level can point you at symptoms rather than causes.

The Step-Based Debugging Approach

The approach we used was simpler and more direct: add timing instrumentation at the logical step level, run the job, read the times.

Every cron job, even a complex one, is a sequence of stages. Fetch data. Transform it. Validate it. Write it somewhere. Send notifications. Clean up. These stages can be coarse or fine — the right granularity is whatever matches the job's logical structure.

The instrumentation looked like this:

code

$timer = new StepTimer();

$timer->start('fetch_enrollments');
$enrollments = fetchPendingEnrollments();
$timer->stop('fetch_enrollments');

$timer->start('enrich_with_course_data');
$enriched = enrichWithCourseData($enrollments);
$timer->stop('enrich_with_course_data');

$timer->start('process_notifications');
processNotifications($enriched);
$timer->stop('process_notifications');

$timer->report(); // logs each step with duration and % of total

This isn't sophisticated. It's a stopwatch on each step and a log at the end. The power is in what it tells you: which specific stage in the logical flow is eating time.

What We Found

The step-level timing output was immediately illuminating. In almost every cron we instrumented, the distribution of time was not what you'd expect from reading the code.

Common patterns we found:

Data fetched multiple times. A cron would fetch a large dataset at the beginning, then fetch subsets of the same data again in later steps — either because the original fetch didn't include what was needed downstream, or because the code had been extended over time without revisiting the data fetch strategy. The fix was to expand the initial fetch and pass the data through, eliminating redundant queries.

N+1 patterns in batch loops. Processing a list of enrollments one by one, with each iteration making its own database call to get related data. Classic N+1. Invisible until you see the step labeled process_enrollment_loop consuming 70% of total job time on a job that should be I/O-light.

Blocking operations in serial that could be batched. Notification dispatch sending one message at a time through an external API when the API supported batch requests. The step timer showed this immediately — a step taking 30 seconds that should take 2.

Steps running on data that had already been filtered out. A transformation running on a full dataset, followed by a filter step, when the filter could have been applied at fetch time. The transformation step was doing work on rows that would be discarded.

None of these were surprising problems. They were all problems that had reasonable explanations — code added incrementally, schema changes that added tables, integrations extended without refactoring the data flow. The step timer just made the cost of each problem visible in a way that reading the code couldn't.

The Result

After systematically working through the cron jobs using this approach — instrument, identify the expensive step, fix the specific cause, verify — we achieved 25x faster execution across the jobs we optimized.

The improvements compounded. A job that had multiple problems didn't improve by fixing the worst one. It improved by fixing each one in sequence. The step timer after each fix showed the new bottleneck, which might have been invisible before because a worse problem was dominating total time.

What This Isn't

This isn't a story about a single clever trick. No single change produced 25x improvement. The gain came from systematic identification and elimination of multiple sources of waste, using an approach that kept the feedback loop tight: instrument, run, see where the time goes, fix, repeat.

The step-based approach works on any batch job — PHP crons, Python ETL scripts, Node.js schedulers. The language doesn't matter. The principle is the same: add timing at the logical step level, not the function level, so you can see the job's behavior as a sequence of stages rather than a call stack.

Looking Forward

The cron optimization work surfaced something broader: a class of workloads being handled by scheduled polling that would be better served by an event-driven model. Some of the jobs we optimized were fundamentally doing "check if anything needs processing, process it, wait for the next run." The work is inherently reactive — it's responding to events that have already happened.

The right design for that class of workload is event-driven: process enrollments when they're created, dispatch notifications when the triggering event fires, rather than polling on a schedule and handling the backlog. We're currently in the design phase for a no-cron event-driven architecture for new systems. The cron optimization experience informed the case for it — when you see exactly where the time goes in a scheduled job, the inefficiency of polling becomes concrete rather than theoretical.

The legacy crons will keep running. The new systems are being built differently.