QSearch: A Refactoring Case Study

A few months ago I volunteered for a rotation on our Applied ML team to help prepare some of Seeq’s beta add-ons for general availability. Compared to the main Seeq platform, add-on codebases are much smaller, weighing in at thousands of lines of code apiece rather than more than a million, but the ML team is also much smaller than the core dev team and made up of industry practitioners, academics, and self-taught programmers, so they wanted a software developer’s perspective on how to make the code maintainable long-term. We decided the best place to start was the QSearch add-on.

Exploring the repo, I found lots of places where code could be more idiomatic, but it seemed like a waste of the team’s time to review and test superficial syntax changes. I wanted to find something more impactful, so I looked for files with lots of code, analyzed the Git history to find the hotspots that tended to churn, and checked cyclomatic complexity. One function in particular stood out in all three metrics, a function named create_qsearch that contained the core algorithm. Discussing with the team, we decided to focus attention there.

(As an aside, I didn’t totally avoid making superficial syntax changes. I’ve given in to a compulsion to polish code often enough that simply knowing there were spots to clean up became a nagging distraction. To deal with it, I created a cleanup branch specifically for small, unimportant changes. Anytime I found something minor to fix, I put it on that branch. I don’t intend to put up a PR with all those tiny changes, but just knowing that they’ve been fixed somewhere, even if not shipped, quieted my nagging brain and let it attend to the main effort without distraction.)

At the outset, create_qsearch was 390 lines of code and 79 logical branches with zero unit tests. The lack of unit tests was due to a dependence on a running instance of Seeq, either locally or on a remote server. One of my goals became to tease that dependence out of create_qsearch so tests could more easily protect against regressions. To do that, though, a lot of incidental complexity needed to be moved out of the way.

The biggest contributor to the high cyclomatic complexity was an unusual pattern of error handling. create_qsearch took a flag that specified whether to throw errors or return partial results, but that meant whenever an error could occur there was an if statement like this:

if raise_errors:
    raise CustomError('Something bad happened')
else:
    return dict_of_accumulated_results

To simplify this pattern, I added a results field to CustomError, allowing errors to be thrown with

raise CustomError('something bad happened',
    results=dict_of_accumulated_results

Then, instead of passing raise_errors=False to suppress errors, call sites could use a regular try/except and get partial results off the caught exception. Those changes alone cut the cyclomatic complexity nearly in half and began to bring greater clarity to the inherent complexity of the function.

My next steps faltered. While I tried moving validation and logging setup out of the core function to some custom Python decorators, it became clear the ML team wasn’t comfortable with that. They worried about the maintenance of functions whose actual signatures were obscured behind decorators that modified the arguments to the decorated functions, either by adding arguments or by accepting arguments that they handled and thus didn’t pass down, and they were right. Just because you can doesn’t mean you should. Also, the decorators were one-off functions that weren’t useful elsewhere and were harder to test. So I backed up and implemented the logic with a more traditional, class-based approach.

Within the inherent complexity of create_qsearch, a lot of back and forth with Seeq’s API remained. Unfortunately, almost all of that I/O wasn’t in create_qsearch itself but in methods on objects that create_qsearch either constructed or was given. To move that I/O out, I first needed to find it. I created a unit test that ran through the whole function without a connection to Seeq. Whenever a connection error occurred, I traced it back to a function or method invocation in create_qsearch, then moved the invocation to an outside function that could be passed in as a keyword argument. In the test, I passed a mocked function and ran the test again to repeat the process and find the next invocation. Eventually we moved the injected functions to a service class and in the tests passed a mocking class instead of the production service.

Now that we had an overarching unit test of the actual functionality, I started moving small chunks of code to their own methods, writing tests of the logical branches in those methods. That covered many more edge cases than was feasible to do writing unit tests of create_qsearch. The code I pulled into methods were chunks that relied on only two or three variables from the create_qsearch‘s scope and that were calculations on those variables, not I/O calls. I also moved statements down as far as possible, deferring their execution until they were needed, which put related code closer together. Massaging the code like this exposed areas of common logic that varied slightly but were fundamentally the same and could be unified, further simplifying the function. Slowly, create_qsearch became a smaller, simpler, more tested function.

At this point the QSearch lead began to see the possibilities of the emerging structure. Decoupling I/O from calculations meant team members could work on certain areas without having to understand the add-on as a whole. One developer could focus on the statistical logic without needing to understand how data was pulled from Seeq or how to push results to Seeq when done, while another could develop the interface to Seeq with no awareness of what number crunching was taking place. Separating concerns would be a big step in maintainability, and the lead developer began talking about other places where I/O could be pulled out of calculations and planning to do it himself.

To finish the refactor, I needed to fetch any necessary information from Seeq before create_qsearch was invoked, which would finally make create_qsearch a pure calculation. All I/O needed to be pulled up and out of create_qsearch. I floated any service calls as high in function as I could, just below any calls whose results they relied upon, then in turn floated those calls up, until all the I/O occurred at the top of the function and could be peeled out into a wrapper function. As a very simple example, code structured like this:

def create_qsearch(*args, **kwargs):
    a = calculation()
    b = fetch_data()
    return another_calculation(a, b)

def wrapper(*args, **kwargs):
    return create_qsearch(*args, **kwargs)

turned into something more like this:

def create_qsearch(*args, b, **kwargs):
    a = calculation()
    return another_calculation(a, b)

def wrapper(*args, **kwargs):
    b = fetch_data()
    return create_qsearch(*args, b=b, **kwargs)

Aside from reworking exception handling, the code didn’t disappear so much as move to new places. Most of the refactoring was about decomposing logic into isolated units rather than improving the original metrics. Early in the process we discussed breaking up create_qsearch into separate methods for each phase, but I cautioned against it. While splitting the code would have immediately improved code length and cyclomatic complexity, it would have made it harder to manipulate the code and decouple unrelated concerns. Metrics pointed out where to work, not what to optimize. If we’d split the logic into separate methods, it would have been more difficult to see the interrelationships between different parts of the code, and floating I/O calls up and out of create_qsearch would have first required floating them up and peeling them out of whatever sub-method they’d been split out to. A flat method was much easier to work with. Regardless of keeping the main body of logic in a single function, the core calculation did shrink from 390 lines to 90, and from 79 logical branches to eight. Still big, but much more maintainable.

For all the refactoring I’ve done over the years, most of it has been intuitive rather than methodical. Because I was working with programmers who didn’t always speak the same language as long-time software engineers, I had to articulate my intuition in greater detail. This kind of clarity of thought is one of the side benefits of cross-functional teams bringing together different perspectives and communication styles.

In summary, the key lessons I took from this experience are:

use code metrics to find the place that needs work, but be cautious about optimizing (i.e., gaming) those metrics
limit focus just to the the spot that needs help, going for depth rather than breadth
refactor away the biggest source(s) of incidental complexity so you can see the core logic clearly
isolate I/O and side-effects into injected functions
pull I/O and side-effects up and out to create pure(r) functions
work iteratively: usually the first thought isn’t the best thought

Related