Tuesday, September 29, 2020

Conducting A / B Testing: Walkthrough

The team invests a lot of work, effort and resources into each change in the game: sometimes the development of a new functionality or level takes several months. The analyst's task is to minimize the risks from the implementation of such changes and help the team make the right decision about the further development of the project.

When analyzing decisions, it is important to be guided by statistically significant data that matches the preferences of the audience, rather than intuitive assumptions. A / B testing helps to obtain such data and evaluate them. In this article, I will share my personal best practices: I will describe each step of A / B, highlight the difficulties and pitfalls that can be encountered, and share my experience of solving them.

6 “easy” steps of A / B testing

For the search term "A / B testing" or "split testing", most sources offer several "simple" steps for a successful test. There are six such steps in my strategy.

At first glance, everything is simple: what does a computer engineer do

there is group A, control, no changes in the game;

there is a group B, test, with changes. For example, new functionality has been added, the difficulty of the levels has been increased, the tutorial has been changed;

run the test and see which variant has better performance.

In practice, it is more difficult. In order for the team to implement the best solution, as an analyst, I need to answer how confident I am in the test results. Let's deal with the difficulties step by step.

Step 1. Determine the goal

On the one hand, we can test everything that comes to mind of each team member - from the color of the button to the difficulty levels of the game. The technical ability to conduct split tests is incorporated into our products at the design stage.

On the other hand, it is important to prioritize all proposals for improving the game according to the level of effect on the target metric. Therefore, we first draw up a plan to launch the split testing from the highest priority hypothesis to the least.

We try not to run multiple A / B tests in parallel, in order to understand exactly which of the new features affected the target metric. It seems that with this strategy, it will take more time to test all hypotheses. But prioritization helps to cut off unpromising hypotheses at the planning stage. We get data that reflects as much as possible the effect of specific changes, and we do not waste time setting tests with questionable effects.

We definitely discuss the launch plan with the team, since the focus of interest shifts at different stages of the product life cycle. At the beginning of the project, this is usually Retention D1 - the percentage of players who returned to the game the next day after its installation. At later stages, these can be retention or monetization metrics: Conversion, ARPU, and others.

Example. Retention metrics require special attention after a project is released into the soft launch. At this stage, let's highlight one of the possible problems: Retention D1 does not reach the level of the company's benchmarks for a specific game genre. It is necessary to analyze the funnel of passing the first levels. Let's say you noticed a large drop of players between the start and the 3rd level completion - the low Completion Rate of the 3rd level.

The purpose of the proposed A / B test: increase Retention D1 by increasing the share of players who have successfully completed the third level.

Step 2. Defining metrics

Before starting the A / B test, we determine the monitored parameter - we select the metric, changes in which will show whether the new functionality of the game is more successful than the original one.

There are two types of metrics:

quantitative - the average duration of the session, the value of the average check, the time it takes to complete the level, the amount of experience, and so on;

high-quality - Retention, Conversion Rate and others.

The type of metric affects the choice of method and tools for assessing the significance of results.

It is likely that the tested functionality will affect not one target, but a number of metrics. Therefore, we look at changes in general, but do not try to find "anything" when there is no statistical significance in assessing the target metric.

According to the goal from the first step, for the upcoming A / B test, we will evaluate the Completion Rate of the 3rd level  - a qualitative metric.

Step 3. Formulate a hypothesis

Each A / B test tests one general hypothesis, which is formulated before launch. We answer the question: what changes do we expect in the test group? The wording usually looks like this:

Statistical methods work from the opposite - we cannot use them to prove that the hypothesis is correct. Therefore, after formulating a general hypothesis, two statistical ones are determined. They help to understand that the observed difference between control group A and test group B is an accident or the result of changes.

In our example:

The null hypothesis ( H0 ): reduce complexity 3rd level will not affect the proportion of users who have successfully completed the third level. The Level 3 Completion Rate for Groups A and B is not really different and the observed differences are random.

The alternative hypothesis ( H1 ): reduce complexity third level will increase the proportion of users who have successfully completed the third level. The Level 3 Completion Rate is higher in Group B than in Group A, and these differences are the result of changes.

At this stage, in addition to formulating a hypothesis, it is necessary to assess the expected effect.

Hypothesis: "We expect that a decrease in the complexity of the 3rd level will cause an increase in the Completion Rate of the 3rd level from 85% to 95%, that is, by more than 11%."

In this example, when determining the expected Completion Rate of the 3rd level, we aim to bring it closer to the average Completion Rate of the starting levels.

Step 4. Setting up the experiment

1. Define the parameters for the A / B groups before starting the experiment: for which audience we launch the test, for what proportion of players, what settings we set in each group.

2. We check the representativeness of the sample as a whole and the homogeneity of the samples in the groups . You can pre-run an A / A test to evaluate these parameters - a test in which the test and control groups have the same functionality. The A / A test helps to make sure that there are no statistically significant differences in target metrics in both groups. If there are differences, an A / B test with such settings - sample size and trust level - cannot be run.

The sample will not be perfectly representative, but we always pay attention to the structure of users in terms of their characteristics - new / old user, level in the game, country. Everything is tied to the purpose of the A / B test and is negotiated in advance. It is important that the structure of users in each group is conditionally the same.

No comments:

Post a Comment

Server management systems

Enterprises receive the services and functions they need (databases, e-mail, website hosting, work applications, etc.) for their corporate I...