What on Earth is Simpson’s Paradox? How Does it Affect Data?

Simpson’s paradox is a phenomenon in probability and statistics, in which a trend appears in different groups of data, but disappears or reverses when these groups are combined.

You need to be very careful while calculating averages or pooling data from different sectors. It is always better to check whether the pooled data tells the same story or a different one from that of the non-aggregated data. If the story is different, then there is a high probability of Simpson’s paradox. A lurking variable must be affecting the direction of the explanatory and target variables.

Let us understand Simpson’s paradox with the help of an example:

In 1973, a court case was registered against the University of California, Berkeley. The reason behind the case was gender bias during graduate admissions. Here, we will generate synthetic data to explain what really happened.

Let’s assume the combined data for admissions in all departments is as follows:

Gender

Applicants

Admitted

Admission Percentage

Men

2,691

1,392

52%

Women

1,835

789

43%

If you observe the data carefully, you’ll see that 52% of the males were given admission, while only 43% of the women were admitted to the university. Clearly, the admissions favoured the men, and the women were not given their due. However, the case is not so simple as it appears from this information alone. Let’s now assume that there are two different categories of departments — ‘Hard’ (hard to get into) and ‘Easy’.

Data Manipulation: How Can You Spot Data Lies?

 

Let’s divide the combined data into these categories and see what happens:

DepartmentAppliedAdmitted

Admission Percentage

Men

Women

Men

Women

Men

Women

Hard

780

1,266

200

336

26%

27%

Easy1,9115691,19245362%

80%

Do you see any gender bias here? In the ‘Easy’ department, 62% of the men and 80% of the women got admission. Likewise, in the ‘Hard’ department, 26% of the men and 27% of the women got admission. Is there any bias here? Yes, there is. But, interestingly, the bias is not in favour of the men; it favours the women!!! If you combine this data, then an altogether different story emerges. A bias favouring the men becomes apparent. In statistics, this phenomenon is known as ‘Simpson’s paradox.’ But why does this paradox occur?

Simpson’s paradox occurs if the effect of the explanatory variable on the target variable changes direction when you account for the lurking explanatory variable. In the above example, the lurking variable is the ‘department.’ In the case of the ‘Easy’ department, the percentages of men and women applying were in equal proportion. While in the case of the ‘Hard’ department, more women applied than men, and this led to more women applications getting rejected. When this data is combined, it shows a visible bias towards male admissions, which is really non-existent.

Simpson's effect data science UpGrad Blog

Now suppose you were a statistician for the Indian government and inspected a fighter plane that returned from the Chinese war of 1965. Inspecting the bullet holes in the aircraft surface, what would you recommend? Would you recommend the strengthening of the areas hit by bullets?

The following is an excerpt from a StackExchange:

“During World War II, Abraham Wald was a statistician for the U.S. government. He looked at the bombers that returned from missions and analysed the pattern of the bullet ‘wounds’ on the planes. He recommended that the Navy reinforce areas where the planes had no damage.

Why? We have selective effects at work. This sample suggests that damage inflicted on the observed areas could be withstood. Either the plane was never hit in the untouched areas — an unlikely proposition — or strikes to those parts were lethal. We care about the planes that went down, not just those that returned. Those that fell likely suffered an attack in a place that was untouched on those that survived.”

Top Steps to Mastering Data Science, Trust Me I’ve Tried Them

 

In statistics, things are not as they appear on the surface. You need to be skeptical and look beyond the obvious during analyses. Maybe it’s time to read ‘Think Like a Freak’ or ‘How to Think Like Sherlock’. Let us know if you already have and what your thoughts are on the same!

The following two tabs change content below.
Thulasiram Gunipati

Thulasiram Gunipati

Thulasiram is a veteran with 20 years of experience in production planning, supply chain management, quality assurance, Information Technology, and training. Trained in Data Analysis from IIIT Bangalore and UpGrad, he is passionate about education and operations and ardent about applying data analytic techniques to improve operational efficiency and effectiveness. Presently, working as Program Associate for Data Analysis at UpGrad.