Thoughts on scalability

2017-03-22

While working in three different organizations that rapidly scaled by at least 10 times the number of employees, scalability became a huge problem each time.

I formed many ideas over the years about how things could have been done better. Everything I’ve listed here has a focus on improving productivity so that having more people doesn’t have to mean growing inefficiency. I feel that they could be taken one at a time, but in some cases there is a lot of cross-dependency. They are not in any particular order of importance.

Curated documentation

It’s easy to say you need more documentation. And in certain areas there is a lot of it, but people aren’t happy with it. Much of it is disorganized, hard to find, hard to search, poorly written, orphaned, out of date, or only understandable to the author. Just like with customer-facing documentation, internal documentation needs curation.

It would be of great use to have a dedicated person (or team) to provide organizational structure to your documentation, edit content for clarity and uniformity, solicit new content, drive updates, make it easy to search, and evangelize the heck out of it. As the organization grows, the value of that curation grows as well.

See How to easily create documention.

New employee documentation

Every organization I’ve been in has new employee documentation on their wiki. But it’s rarely very good. It’s not that the author has done a poor job, but that t’s really hard to document something long after you went through it yourself. If there were a curator for documentation, they could talk to new employees and document what they struggled with, starting with day one. New people soon outnumber existing people in a growing organization and great documentation will bring new people up to speed faster without consuming too much time from everyone else.

Encourage resourcefulness

It’s really easy to ask your neighbor or a chat room or a large mailing list the answer to a question. But, there are available resources which people should be encouraged to use (documentation!). Instead of always providing answers, point people to relevant documentation. They will learn the answer and also a potentially new source of information. It’s a much better use of everyone’s time if resourcefulness is encouraged from co-workers and management. It’s too easy on a project to have senior people spending more time than necessary with new recruits. As the organization scales, the senior people end up doing far less engineering.

Who does what

It’s very difficult at large companies to find out who does what. If you see a name in email or in a bug report and look them up, it’ll say something very generic. If you knew what they worked on, you could have much more efficient conversation with them that avoids a lot of back and forth.

If you want to know who works on a given team or in a specific role, you often can’t find out. You have to dig around or ask people or just give up. Even knowing DRIs for each product would improve efficiency greatly. When an organization is small, it’s easy to keep this information in your head. But this quickly runs into scalability problems and makes things less productive, especially for new employees who must build a mental model of who does what themselves.

Status reports

Related to knowing who does what is something very useful I’ve seen used in the past. Every one would post a status report every 2 weeks to a web page and then everyone could see it. It could be a few sentences, what you did the last 2 weeks and what you are working on.

As the organization grows, it becomes nearly impossible to be aware of anything outside of your small area. It’s useful to see the status of a person or a team that you care about or need to work with. I believe this could also facilitate cross-functional projects or even spark some new ideas. You can’t know everything a large organization does but if there is one area that you’d like to know about, it would be useful to have that information at your fingertips.

Code review

Many people would hate code review. However, this seemingly time consuming activity can be a net positive in my mind. By having at least one other person look at code being committed, you have at least one more person aware of the changes should the main author be unavailable. Also, you learn something about the project you didn’t now.

Code review also broadens the knowledge base of fellow engineers which could lead to more efficiency if feature or bug fixing needs to be shared. Also, it just leads to better code if someone has to sign off on it and have their name associated with it. It’s not as much finger-pointing as making it be new changes be taken more seriously.

Avoid acronyms

Acronyms can be a nice shortcut, but they also waste an inordinate amount of people’s time. Not a week goes by that you run across a new acronym you don’t understand. I expect that every big company has a web page devoted to translating acronyms. I’ve been at some that have multiple competing pages.

While a team may understand its own acronyms, there are other teams that have ones you won’t understand. There are some case cases like QA or UI/UX) where the acronym is almost universally recognized, so they can be useful shortcuts.

Elon Musk has an awesome diatribe on this topic called Acronyms Seriously Suck. Note that this is my only endorsement of something Elon Musk has said.

Avoid codenames

Once a product is shipped, avoiding codenames as much as possible can be a huge time-saver. With more new employees coming in than old ones, they don’t know what these codenames mean. I see so many people keep their own cheat sheets of codenames and it’s a time waster. Similar to acronyms.

Root causing bugs

When a new bug report comes in, try to put aside time for root cause analysis, and then comment in the bug. It can be as simple as a quick sentence or a theory or a pointer to a specific area of the project where the bug lies. When the time comes to try to fix this bug, this initial analysis could be very useful, especially in cases where someone else or another team ends up owning the bug. It’s also incredibly useful to have this information when reviewing bugs to gauge the seriousness of a bug, how much effort it would take to fix, or the riskiness of meddling with the code in that area.

Also, the bug filer feels watching the bug will feel that they bug was taken seriously and may be more cooperative. Others can learn more about the product by seeing the analysis.

Dedicated tools team

Every organization should have a tools person and eventually a tools team as it gets large enough. Sometimes teams have their own needs and write their own specialized tools, but other times there are tools that would benefit an entire organization. Some great tools can start as a side project of one person and remain that way, where it would fight for their time against their regular job.

When someone leaves a team or the company, the tool gets orphaned. A tools team could proactively seek out the needs of the organization and try to address them. Useful tools that are rough around the edges could be given from an individual engineer to a tools team to develop it further.

Integrated bug report filing

If someone finds a bad bug, we want them to file it as painlessly as possible. Getting bugs quickly after a build goes out means for quicker bug fixing. Integrating a bug reporting feature into your product that is quick and efficient, while gathering all the necessary logs and supporting files, is essential as the organization gets larger.

Embedded QA is better

I’ve worked in organizations with 3 different setups: QA working directly on engineering teams, a separate QA organization, and a separate QA organization, but co-located with engineering team.

For me, embedded QA wins hands down. This greatly shortens the feedback loop to engineering, which saves an extraordinary amount of time. Having quick access to engineers and staying more informed about the day-to-day engineering facilitates that shorter turnaround time. If an issue can be caught hours rather than weeks later, an engineer can fix it significantly faster. As an engineer switches contexts, it’s easier for them to fix things if they are still in that context or haven’t spent too much time away from it.

QA feedback into feature process

QA engineers should have a voice or be otherwise involved in the feature planning because of their domain knowledge about what people will struggle with. They see a broader set of detailed user issues than any engineer, manager, or director will ever see. They can provide great feedback that can hopefully lead to products that will have less usability issues, which saves time for a growing engineering organization.

Mailing lists

Scalability and mailing lists don’t belong in the same sentence. The efficiency of email declines rapidly as an organization grows. With 10,000+ people on a mailing list, the amount of time wasted on finding that one useful message is very high. Because of the message volume, those that have the most domain knowledge typically tune out, leaving only those that need help and aren’t getting it.

Some sort of forum-based (with reputation functionality) or other solution should be explored as mailing lists break down. Also, except for “announce” mailing lists, everything else should be opt-in whenever possible. Forcing people to be inundated with email just makes them filter out and ignore. Opting in cuts down traffic and attracts only those that really have something to contribute.

Track bugs in a real database

As an organization exists, you simple can’t keep all the issues in your head or on sticky notes or on a wiki. You need to know what the issues are and what the status of them are without needing to dig around to find it. You need the audit trail and the history. Without it, you are disorganized, waste time fixing the wrong bugs, and can’t plan effectively for new releases.

Regular bug review

I think every team should have a bug review at least weekly, not just near the end of a release cycle. Every engineer’s list of bugs must be kept as lean and mean as possible, only containing bugs that reflect reality and have a reasonable chance of being addressed in the release they are tagged with. This means aggressive review of old bugs and getting rid of ones that you realistically will never address. With a hundred bugs instead of a thousand per engineer, they reasonably could review their queue every few months, which means less time needed for bug reviews and less search results polluted with irrelevant bugs.

Bug verification

People that file bugs should get the bugs assigned to them after it is fixed. Then they can verify it is truly fixed given their setup.

Failing to verify these bugs may mean you ship with them unaddressed, leading to wasted time doing emergency or software updates.

n’t afford to have thousands of bugs sitting around in Verify, wondering if they are truly fixed or not. This means more work doing emergency or software updates to fix things that actually didn’t get fixed on the first attempt.

Pre-submission testing

Nothing should be submitted to a build without some sort of pre-submission testing. At the very least, some minimal testing in areas that are part of that night’s submission. This is another case where the feedback loop can be shortened between a commit and reporting a bug to the engineer, improving efficiency. Don’t wait for a dozen people to hit the bug the next day. Let your one QA person find it today.

Per-commit testing

Even one step deeper that pre-submission testing is per-commit testing. Hard to achieve, but if there is sufficient staffing to allow a tester to watch commits, they could identify ones they think could have side effects quickly and immediate feedback could be given. Imagine reporting a bug to an engineer a few minutes after they committed the code. They might even have the source file still open and they can fix it much easier than days later.

Testable builds for QA

This goes along with pre-submission testing. The minimum quality level that we need to shoot for is a build is testable by QA. It can have lots of warts, but if QA can’t do basic functionality testing at all, that’s a tremendous waste of their resources. If this situation persists, suddenly they have a large set of tests to run but much less time before convergence ends. As complexity grows, QA needs to maximize their testing time.

Log gathering

Make gathering logs incredibly simple for bug filers, knowing that some may be less technical people. You could either:

On by default if they don’t interfere too much with the program
Ability to turn on within the app itself
Ability to turn on in an integrated bug reporting UI

Bug report titles

When I did screening of bug reports, I would change the title of every bug I screener. I feel it’s of enormous benefit not only to the screener trying to find bugs later, but it’s extremely useful to anyone searching for bugs and especially in bug reviews.

I write them like a headline for the bug and move any metadata to other fields or remove it if not relevant. When you look at a list of bugs, you can browse through them quickly, and just read titles to jog your memory. It can be a great time saver all around. As bug volumes grow, there is less time to dig through the data overload within the bug itself, so this can help a lot with scalability.

Public relations for teams

I feel teams should care about how they are perceived by other teams, which can facilitate working together more productively. People will be more cooperative when filing bug reports if they believe the team will take them seriously.

Some things you can do as a team:

Review incoming bugs quickly, comment in them or, better yet, root cause them
Pay attention to mailing lists where people may discuss your product and respond promptly
Set up a team website/wiki with any relevant information about your team and your product.

As an organization gets bigger, you’re less likely to know all the teams you may be involved with. We can work better together if we have a respect for what other teams are doing and for their work ethic.

Efficient use of testing resources

This is pretty well covered in My hierarchy of testing. Basically, improving testing practices can have a profound impact on the productivity of an entire company.

Automation

There is often a push for more automation, but frequently it can lead to more time being spent maintaining an automation suite than actually using it. Automation needs to find enough import bugs to justify the effort put into them.

QA engineers that are the best at creating automation suites are often the best at finding bugs using test plans or ad hoc testing, so automation can take them away from higher value testing. A lot more care needs to be given to select things that have proven to be useful when automated, like performance testing, data correctness, and cases where there are a tremendous number of variations in a model. Manually performing some tests might be tedious, but you may also catch things that an automated test won’t. Human eyes are most often better than UI Automation.

See My hierarchy of testing.

Overzealous secrecy

Companies often go overboard on security at the expense of allowing for adequate testing in the real world. While people need to understand the need for secrecy, they also need to be trusted to make intelligent decisions on how to protect the company’s intellectual property. Don’t just default to “don’t use this in public” without thinking of alternatives that still allow some real-world testing.

The end.

Just admit QA was right