Recently I came across a discussion of an article about the rates of maternal mortality in the USA. There was an opinion there that the United States are far from being a developed country on this index. Since the Gapminder data were an option for the data investigation project, I decided to explore the sutiation myself using the world data on maternal mortality ratio.
The following questions were posed and explored in the project:
Maternal mortality ratio is the number of maternal deaths divided by the number of live births in a given year, multiplied by 100,000. Maternal death is defined as the death of a women while pregnant or within the 42 days after termination of that pregnancy, regardless of the length and site of the pregnancy, from a cause related to or aggravated by the pregnancy.
The data available included observations of maternal mortality ratio in 187 countries in 1800-2013. Since for most countries there is information only for specific years during 1980-2013, the exploratory analysis was limited to these years. The data were investigated from time perspective and also in geographical and economical context. Also other parameters of health economy and population were added to determine possible correlations in the trends.
The data on maternal mortality were obtained from Gapminder.com, section Data/Health/Maternal health, as a .csv file [1]. Regional information was also downloaded in .csv format [2]. The classification of income groups for the corresponding years was obtained from the World Bank website as xls. and then truncated and converted to .csv in spreadsheet software [3].
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
% matplotlib inline
# Data source: https://www.gapminder.org/data/
# See section: Data/Health/Maternal health/Maternal Mortality
mm_data = pd.read_csv('maternal_mortality_ratio_per_100000_live_births.csv')
mm_data.info()
mm_data.head(15)
As can be seen from the example of the data above, for many years there are no observations for most countries (zeros aren't meaningful and come from missing data). We can get more detailed information on each year, counting unique values in each column to find out how many other values beside zero are in the column.
# detemine for which years there are observations available for most of the countries
pd.options.display.max_rows = 250
mm_data.nunique()
The missing data for years earlier than 1980 limits the possibility of analysing data from deep historical perspective. However, for the last three decades there is enough data for global estimates and comparisons. The pattern in data since 1980 allows to assume that starting from that year a standardized procedure was inplemented for regular data gathering.
Still it may be informative to take a look at the dynamics of the oldest data available. For this purpose we can use the list of the countries that have records starting from the beginning of the XXth century.
mm_data[mm_data['1900'] > 0][['country', '1900']]
countries_1900 = list(mm_data[mm_data['1900'] > 0]['country'])
plot= mm_data[mm_data['country'].isin(countries_1900)
].set_index('country').replace(0, pd.np.nan).T.plot.line(
figsize = (15, 8), title='Maternal mortality in countries with oldest records', marker='.')
plot.set(xlabel="Years", ylabel="Maternal mortality ratio, per 100 000 live births")
plot;
For most years available Sri Lanka demonstrated much higher values than other countries which are European except for United States. The ratios in the USA are also higher than in Europe representatives in 1900-1935, but to a lesser extent. However, around 1940-1950 a noticeable downward trend started for all six countries. Nevertheless, to inverstigate global tendencies, a greater number of countries is necessary. The data set contains the list of 187 countries, it seems reasonable to limit the number of observation per year to at least 100.
enough_data = mm_data.nunique() > 100
enough_data[enough_data == True]
# determining countries with no data
(mm_data == 0).astype(int).sum(axis=1)
Many countries only have data for 7 years of 214 included in the data set. That gives us a limit of 207 "zero years" per row to determine the countries that have missing values for the years where most countries have observations. If the limit is exceeded the country has missing data for any the years chosen for the study.
no_data_countries = (mm_data == 0).astype(int).sum(axis=1) > 207
ndc_index = list((no_data_countries[no_data_countries == True]).index)
ndc_index
mm_data.iloc[ndc_index, [0]]
For 7 countries included in the source file there is no information provided for most years, the same can be easily observed on Gapminder. Also there are only several years containing observations for almost all countries. Therefore the data set can be limited to the following seven years: 1980, 1990, 1995, 2000, 2005, 2010 and 2013, which still can properly represent historical dynamics in recent decades.
For further exploration we need to subset the data to the list of the years stated above. Also since zeros originated from missing data in the source file, it is reasonable to drop zeros as NAs both for years and countries. That also results in type conversion for integer variables.
# columns to keep
# this list will also be used further for years as col_list[1:]
col_list = list(enough_data[enough_data == True].index)
mat_mort_100k_lbirths = mm_data[col_list]
mat_mort_100k_lbirths.info()
#handling missing values
mat_mort_100k_lbirths = mat_mort_100k_lbirths.replace(0, pd.np.nan)
mat_mort_100k_lbirths.info()
mat_mort = mat_mort_100k_lbirths.dropna(axis=0, how='any')
mat_mort.describe()
As can be seen from the summary statistics, the range of maternal mortality ratios narrowed from [5.8, 2120] in 1980 to [1, 1100] in 2013. The mean tends to be closer to the maximum, while the median - much closer to the minimum, which means that the distrubution is positively skewed. IQR has also descreased by 324.8, moving leftward at the same time. By visualising the changes of global averages we can see that while the mean decreased gradually over time, the median dropped most significantly in 1980-1990.
plot = mat_mort.median().plot(title='Maternal Mortality: Global Median')
plot.set(xlabel="Years", ylabel="Maternal mortality ratio")
plot;
plot = mat_mort.mean().plot(title='Maternal Mortality: Global Average')
plot.set(xlabel="Years", ylabel="Maternal mortality ratio")
plot;
The trends for global mean and median look inspiring - the world definitely seems to becoming a safer place for giving birth. The spread and other statistics of the maternal mortality ratio have decreased during the period in question, most of them - quite noticeably. However, world average indices are not very informative. More insights can be obtained from data structured by geographical or economical parameters. Gapminder provides several options of regional divisions in its geographical data.
#data source: https://www.gapminder.org/data/geo/
regions = pd.read_csv('list-of-countries-etc.csv')
regions.columns.values
The file contains several variables which won't be used in further analysis, these variable can be omitted. Also strings representing factor variables need to be converted to categorical type. After that both dataframes can be merged together on country names. 'geo'
column is left for merging with income groups data.
# subsetting regional variables
cols_to_keep = ['geo', 'name', 'four_regions', 'eight_regions', 'six_regions',]
regions = regions[cols_to_keep]
# turning string variables into factors
cols_to_factor = ['four_regions', 'eight_regions', 'six_regions']
for column in cols_to_factor:
regions[column] = regions[column].astype('category')
regions.info()
#merging region info with maternal mortality data
mat_mort_regions = pd.merge(mat_mort, regions, how = 'left', left_on = 'country', right_on = 'name')
mat_mort_regions = mat_mort_regions.drop('name', axis=1)
mat_mort_regions.info()
The World bank data on income groups on Gapminder include the classification for 2017 only, so the data for the earlier years were obtained from the World Bank website directly as .xsl file. The data available started from 1984, so for the purpose of this project the corresponding years were subset starting from 1990.
Country classifications are determined by World Bank once a year and based on estimates of gross national income (GNI) per capita for the previous year. The classification tables include all World Bank members, plus all other economies with populations of more than 30,000. [4]
#adding income data
wb_income = pd.read_csv('WB_income.csv')
wb_income.head(10)
Missing data in the file are represented by ".." and need to be replaced by NaN values. The names of countries may differ from those in Gapminder data, so 'code' column will be used for merging and 'country' column may be excluded.
wb_income = wb_income.replace("..", pd.np.nan)
wb_income = wb_income.drop(['country'], axis=1)
wb_income.head()
The data represent the classification of income groups encoded with first letters:
H - High income
UM - Upper middle income
LM - Lower middle income
L - Low income
To estimate the change in countries' economic situations we can convert these group names to numeric ranks. This also will allow us to get a country's average performance during the period in question.
#setting numeric ranks
def income_to_rank(value):
"""
The fuction returns numeric ranks for World bank income group labels
"""
val_dict = {"L": 1, "LM": 2, "UM": 3, "H": 4}
if value is not pd.np.nan:
return val_dict[value]
else:
return value
years = ['1990', '1995', '2000', '2005', '2010', '2013']
for year in years:
col_name = year + '_inc_rank'
wb_income[col_name] = wb_income[year].map(income_to_rank)
#getting average rank for 1990-2013
wb_income['avg_inc_rank'] = wb_income.iloc[:, 7:13].mean(axis=1)
wb_income.head()
#getting change in ranking during 1990-2013
for year in years[1:]:
col_name = year + '_rank_change'
if year == '2013':
prev_year = str(int(year) - 3)
else:
prev_year = str(int(year) - 5)
wb_income[col_name] = wb_income[(year + '_inc_rank')] - wb_income[(prev_year + '_inc_rank')]
wb_income['rank_change_sum'] = wb_income.iloc[:, 14:19].sum(axis=1)
wb_income.head()
#excluding auxiliary columns
wb_income = wb_income.drop(["1990_inc_rank", "1995_inc_rank", "2000_inc_rank",
"2005_inc_rank", "2010_inc_rank", "2013_inc_rank",
"1995_rank_change", "2000_rank_change",
"2005_rank_change", "2010_rank_change", "2013_rank_change"], axis=1)
wb_income.head()
# converting income group labels for readability
wb_income[years] = wb_income[years].replace('L', 'Low income')
wb_income[years] = wb_income[years].replace('LM', 'Lower middle income')
wb_income[years] = wb_income[years].replace('UM','Upper middle income')
wb_income[years] = wb_income[years].replace('H', 'High income')
for year in years:
wb_income[year] = wb_income[year].astype('category')
#converting geocodes to lowercase for merging
wb_income['code'] = wb_income['code'].str.lower()
wb_income.head()
wb_income.info()
mat_mort_regions = pd.merge(mat_mort_regions, wb_income, how = 'left', left_on = 'geo', right_on = 'code', suffixes=('', '_income'))
mat_mort_regions = mat_mort_regions.drop(['geo', 'code'], axis=1)
mat_mort_regions.info()
# getting the list of countries having missing data in income classification for 1990
", ".join(list(mat_mort_regions[pd.isnull(mat_mort_regions['1990_income'])]['country'].sort_values()))
The prepared data frame contains data on country names, maternal mortality ratio for each country in a list of year (1980, 1990, 1995, 2000, 2005, 2010, 2013), three option of regional classifications, income classifications for the corresponding years, excluding 1980, average income ranks and change in income rank in 1990-2013. The income missing data of 1990 refer mostly to countries which at that time were part of other states, like the USSR, Yugoslavia and Czechoslovakia, which should be taken into account during exploratory analysis.
# world distribution
years = ['1980'] + years
plot = mat_mort_regions[years].boxplot(figsize = (10, 6))
plot.set(xlabel = 'Years', ylabel='Maternal mortality ratio in countries', title = 'Maternal mortality in the world')
plot;
During 1980-2013 years maternal mortality in the world has decreased significantly. The distribution of observations became narrower and more concentrated below 100 cases per 100000 live births.
def plot_many_hists(years, data, figsize = (9, 8)):
"""
The function takes in a list of years, a dataframe with these years
and (optional) a tuple with the dimensions of the plot and shows
the plot with several distributions on it
"""
plt.figure(figsize = figsize)
for year in years:
plt.hist(data[year], edgecolor='black',
range=(0,2500), bins = 2500//50, alpha=0.5, label=year)
plt.title('Maternal Mortality Across The World')
plt.xlabel('Maternal mortality ratio')
plt.ylabel('Number of countries')
plt.legend(loc='upper right')
plt.show()
decades = ['1980', '1990', '2000']
plot_many_hists(decades, mat_mort_regions, (10, 8))
xxi = ['2000', '2013']
plot_many_hists(xxi, mat_mort_regions, (10, 8))
mat_mort_stats = mat_mort_regions[years].describe()
mat_mort_stats = mat_mort_stats.T
mat_mort_stats
mm_diff = (mat_mort_regions.describe()['2013'] - mat_mort_regions.describe()['1980'])/mat_mort_regions.describe()['1980']
mm_diff.loc[['min', 'max', 'mean', '25%', '50%', '75%']]
The mean mortality ratio over the world dropped from 362.3 in 1980 to 162.8 in 2013 (or by 55.1%). The median decreased from 169.5 in 1980 to 64 in 2013 (by 62.2%, mostly during 1980-1990). The mean was decreasing gradually during the whole period in question, while the median dropped most significantly in 1980-1990. This can be observed on the following chart.
plot = mat_mort_stats[['mean', '25%', '50%', '75%']].plot.line(title = "Summary Statistics of Maternal Mortality Over Time")
plot.set(xlabel = 'Years', ylabel = 'Maternal mortality ratio');
def below_world_avg_in_year(year, data):
"""
Returns world average for the year and the number and proportion of the countries
who had maternal mortality ratio below the world average in a given year
"""
year = str(year)
world_avg = data[year].mean()
above_avg = data[year] < world_avg
n_countries = above_avg.sum()
proportion = above_avg.mean()
return (year, round(world_avg, 2), n_countries, round(proportion, 2))
for year in col_list[1:]:
print(below_world_avg_in_year(year, mat_mort_regions))
print('year', '>1k', '<100')
for year in col_list[1:]:
print(year,
round(mat_mort_regions[mat_mort_regions[year] > 1000]['country'].count()*100/mat_mort_regions['country'].count(), 2),
round(mat_mort_regions[mat_mort_regions[year] < 100]['country'].count()*100/mat_mort_regions['country'].count(), 2))
The share of countries having maternal mortality ratios lower than world average remained at 67-68% in 1990-2013, and was only several percent lower in 1980 (63%). Meanwhile the number of countries with maternal mortality ratio lower than 100 increased from 37.8% in 1980 to 60% in 2013.
The following countries were on the top and bottom positions during the years in question:
for year in col_list[1:]:
print('Maximum:', mat_mort_regions.iloc[mat_mort_regions[year].idxmax()][['country', year]])
print('Minimum:', mat_mort_regions.iloc[mat_mort_regions[year].idxmin()][['country', year]])
print()
mat_mort_regions[mat_mort_regions['country'] == 'Sierra Leone']
mat_mort_regions[mat_mort_regions['country'] == 'Bhutan']
mat_mort_regions[mat_mort_regions['country'] == 'Sweden']
mat_mort_regions[mat_mort_regions['country'] == 'Canada']
mat_mort_regions[mat_mort_regions['country'] == 'Greece']
mat_mort_regions[mat_mort_regions['country'] == 'Italy']
mat_mort_regions[mat_mort_regions['country'] == 'Ireland']
mat_mort_regions[mat_mort_regions['country'] == 'Belarus']
The lowest ratios during 1980-2013 were observed in European countries. The highest ratio in 1980 was seen in Bhutan, but the country followed the world trend of decreasing maternal mortality during the following decades (alongside the improvement of its income ranking position). The new "leader" - Sierra Leone - emerged in 1990, and demonstrated quite different dynamics - with even more increase in 1995 (the peak corresponds to the years of the civil war in the country [5] ). Sierra Leone was able to return to the level of 1980 only in 2010, getting back to the global trend after 2000.
The divisions into income groups is based on the classifications reported annualy by the World Bank [3]. Of these data only the years corresponding to those in maternal mortality data set were used. In exploratory analisys of maternal mortality from the perspective of a country's economic position the following questions may be considered:
Since the data on income group for 1980 are unavailable, the earliest year to consider is 1990. The sublots define four different patterns depending on the income groups.
# distributions in different income groups
def make_box_plot(df, group, x_lab, y_lab, fig_size = (10,11)):
"""
The function returns a set of labeled subplots for specific groups of a variable in a dataframe
"""
sub_df = df.drop(['avg_inc_rank', 'rank_change_sum'], axis=1)
plots = sub_df.groupby(group).boxplot(figsize = fig_size)
for plot in plots:
plot.set(xlabel = x_lab, ylabel = y_lab)
make_box_plot(mat_mort_regions, '1990_income', 'Years', 'Maternal mortality ratio')
mat_mort_regions.groupby('1990_income')['1980', '2010'].describe()
def get_change(group, data, year_1, year_2):
"""
The function returns the change of mean and median in percentage
for two given years, data split into categories of the group
"""
year_1 = str(year_1)
year_2 = str(year_2)
group_year1 = data.groupby(group).describe()[year_1]
group_year2 = data.groupby(group).describe()[year_2]
change = (group_year2 - group_year1)*100/group_year1
return change[['mean', '50%']]
# change in maternal mortality ratio in 2013 in comparison with 1980
get_change('1990_income', mat_mort_regions, 1980, 2013)
As we can see from the plots and summary statistics, in high income countries the distribution of maternal mortality ratios was and remained quite narrow and close to 0. The number of outliers has decreased over time. In upper middle income group the range of the distribution fell below 500 in 1980 and mostly got down 250 by 2013. In lower middle income group the range narrowed dramatically (from about 950 to 250 for its upper limit), still the number of outliers is pretty noticeable. In low income group the distribution is the widest for each year. Though it tends to become narrower, the decrease is rather gradual. This group also had the lowest decrease in mean and median of maternal mortality ratio distribution than other income groups.
However, the trends described above doesn't include the countries, which didn't have a separate income rank in 1990. We can plot them separately.
make_box_plot(mat_mort_regions[pd.isnull(mat_mort_regions['1990_income'])],
'1995_income', 'Years', 'Maternal mortality ratio')
The most of the countries in this group in 1990 were parts of the unions, that were listed by World Bank in upper middle income group [3]. Thus, though they underwent some economic difficulties, the ratios of maternal mortality in the countries which in 1995 were classified as of lower middle income, are much lower than in other countries in this group across the world.
#grouping by income classification of 2013
make_box_plot(mat_mort_regions, '2013_income', 'Years', 'Maternal mortality ratio')
If we group the countries by their income rank in the end of the period, we can see that the countries in higher income groups are demostrating the behavior of lower income groups from the plot based on groups of 1990. Here are two examples.
row = mat_mort_regions[mat_mort_regions['2013_income'] == 'High income']['1990'].idxmax()
mat_mort_regions.iloc[row]
row = mat_mort_regions[mat_mort_regions['2013_income'] == 'Upper middle income']['1990'].idxmax()
mat_mort_regions.loc[row]
Of all countries in the data set 97 kept their income category over 1980-2013 (or changed it and then returned back to it), 69 countries improved their position by transitioning into the next income group, 11 countries went two ranks up, while 2 countries went 1 rank down. Angola is the leader in speed going 3 ranks higher - from low into high income country, also going from peak of maternal mortality in 1990 (1600) to 290 in 2013.
mat_mort_regions['rank_change_sum'].value_counts()
Plotting average income rank against maternal mortality ratios in 2013, which can be considered a result of development through the whole period, we can see, that those countries whose position in ranking went higher also had lower maternal mortality ratios than countries which had a stable position in higher rank.
mat_mort_regions.plot.scatter(x='avg_inc_rank', y='2013',
c = 'rank_change_sum', cmap = 'viridis', figsize = (12,6), sharex=False
).set(xlabel = 'Average income rank in 1990-2013',
ylabel = 'Maternal mortality ratio in 2013',
title = 'Average income rank vs Maternal mortality ratio');
The correlation between average income rank and maternal mortality ratio appears to be strong and negative, though the relationship is non-linear and the Pearson correlation coefficient increases if logarithmic scale is used.
mat_mort_regions['avg_inc_rank'].corr(mat_mort_regions['2013'])
mat_mort_regions['avg_inc_rank'].corr(np.log(mat_mort_regions['2013']))
mat_mort_regions['avg_inc_rank'].corr(mat_mort_regions['2013'], method='kendall')
From the geographical perspective the lowest rates in 1980-2013 were demonstrated by the countries in Europe, the highest - in Africa. The greatest decrease of the range of the distribution can be seen in Asia, while in Americas the changes during this period were less pronounced and in percentage close to dynamics in Africa, though in absolute numbers Africa made the greatest progress (see summary statistics below).
# distributions in geographical regions
make_box_plot(mat_mort_regions, 'four_regions', 'Years', 'Maternal mortality ratio')
mat_mort_regions.groupby('four_regions')['1980', '2013'].describe()
get_change('four_regions', mat_mort_regions, 1980, 2013)
However, four regions give us a rather broad view. If we consider more specific groups, some regional differences can be seen from the charts.
make_box_plot(mat_mort_regions, 'six_regions', 'Years', 'Maternal mortality ratio', (10,15))
make_box_plot(mat_mort_regions, 'eight_regions', 'Years', 'Maternal mortality ratio', (12,15))
The ratios in East European countries were higher than in West European countries, but in both regions the distributions lie closer to 0 than in any other region, having their measures of center below 15 in the East and below 10 in the West in 2013 with about 70% decrease of the means from 1980.
Both West Asia and East Asia demonstrated the decrease of maternal mortality ratios, going from the mean of about 425 for both regions in 1980 to 95 and 79 respectively in 2013. North Africa and Sub-Saharan Africa show quite different trends in 1980-2013: while in North Africa the distribution moved down 500 after 1980, below 250 in 2000 with further decrease of ratios, Sub-Saharan Africa went through increase of maternal mortality in 1990-2000 in comparison with 1980 getting back the 1980th levels in 2000. After that the decrease continued but still the overall scale of indexes is incomparable with other regions.
The distributions in North America and South America over years looks similar except for outliers. The ratios looks average in comparison with other regions, the percentage decrease in average ratios about 45-52%.
mat_mort_regions.groupby('eight_regions').describe()[['1980', '2013']]
change_8 = get_change('eight_regions', mat_mort_regions, 1980, 2013).sort_values(by=['mean'])
change_8
The highest decrease in means can be observed in West and East Asia, followed by North Africa, in medians - in North Africa, followed by Asia.
plot = change_8.plot.bar()
plot.set(xlabel = 'Regions', ylabel = 'Change');
We can also combine the region and income variables to see if there are any difference in trends for countries of the same income group but located in different regions and vice versa.
#medians in regional income groups
df_income_eight = mat_mort_regions.iloc[:, 0:17].groupby(['2013_income', 'eight_regions']).median()
df_income_eight['n_countries'] = mat_mort_regions.groupby(['2013_income',
'eight_regions']).count()['country']
df_income_eight
From the table above we can see some example of such comparisons for income groups of 2013. Thus the high income countries in North America tend to have higher median maternal mortality ratios than upper middle income countries of West Europe. Sub-Saharan Africa's pattern of high ratios can be seen in all income groups, expect the fact that in upper middle income group the median didn't went up in 1990, as can be seen on the following plots.
income_groups = list(mat_mort_regions['2013_income'].cat.categories)
for group in income_groups:
group_regional_medians = mat_mort_regions[mat_mort_regions['2013_income']
== group].iloc[:, 0:17].groupby('eight_regions').median()
plot = group_regional_medians.T.plot.line(figsize = (8, 6), title = group + " in 2013, medians")
plot.set(xlabel = 'Years', ylabel = 'Maternal mortality ratio')
plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
Since the economy and geography are the characteristics that can only be considered to influence maternal mortality indirectly, we can explore other variables that may have more direct impact on maternal mortality ratios in different countries. Some of them refer to the demographical parameters, like total fertility rate or median age, some - to the healthcare economics, like government share of health spendings or total health spendings, or to the healthcare system - like number of births attended by skilled health staff.
The information was obtained on Gapminder.com in the following sections:
#Loading and preparing data to follow the time period of the maternal mortality data, cleaning missing data
attended_births = pd.read_csv('births_attended_by_skilled_health_staff_percent_of_total.csv')
attended_births = attended_births.replace(0, pd.np.nan)
attended_births.info()
#no data before 1984
attended_births[['country', '1990', '1995', '2000', '2005', '2010']].info()
#little data for 1990 and 1995
total_health_spending = pd.read_csv('total_health_spending_per_person_us.csv')
total_health_spending = total_health_spending.replace(0, pd.np.nan)
total_health_spending.info()
# no data before 1995
gov_share_health_spending = pd.read_csv('government_share_of_total_health_spending_percent.csv')
gov_share_health_spending = gov_share_health_spending.replace(0, pd.np.nan)
gov_share_health_spending.info()
#no data before 1995
total_fert = pd.read_csv('children_per_woman_total_fertility.csv')
total_fert = total_fert.replace(0, pd.np.nan)
total_fert[['country', '1990', '1995', '2000', '2005', '2010']].info()
median_age = pd.read_csv('median_age_years.csv')
median_age = median_age.replace(0, pd.np.nan)
median_age[['country', '1990', '1995', '2000', '2005', '2010']].info()
Since for most counries the data for all variables in question are available only for several recent years, three dataframes were created as time slices - of 2000, 2005 and 2010, which allows not only to check for correlations, but also to estimate the dynamics over time in recent decades.
#Building dataframes
pd.options.mode.chained_assignment = None
def create_df_for_year(year, mat_mort_df, att_births_df, health_spend_df, gov_share_df, total_fert_df, med_age_df):
new_df = mat_mort_df[['country', 'eight_regions', '1995_income', year]]
new_df.rename(columns={year: 'mat_mort'}, inplace=True)
new_df = pd.merge(new_df, att_births_df[['country', year]], how = 'left', on='country')
new_df.rename(columns={year: 'att_births'}, inplace=True)
new_df = pd.merge(new_df, health_spend_df[['country', year]], how = 'left', on='country')
new_df.rename(columns={year: 'health_spend'}, inplace=True)
new_df = pd.merge(new_df, gov_share_df[['country', year]], how = 'left', on='country')
new_df.rename(columns={year: 'gov_share'}, inplace=True)
new_df = pd.merge(new_df, total_fert_df[['country', year]], how = 'left', on='country')
new_df.rename(columns={year: 'total_fert'}, inplace=True)
new_df = pd.merge(new_df, med_age_df[['country', year]], how = 'left', on='country')
new_df.rename(columns={year: 'median_age'}, inplace=True)
return new_df
df_2000 = create_df_for_year('2000', mat_mort_regions, attended_births,
total_health_spending, gov_share_health_spending, total_fert, median_age)
df_2000.describe()
df_2005 = create_df_for_year('2005', mat_mort_regions, attended_births,
total_health_spending, gov_share_health_spending, total_fert, median_age)
df_2005.describe()
df_2010 = create_df_for_year('2010', mat_mort_regions, attended_births,
total_health_spending, gov_share_health_spending, total_fert, median_age)
df_2010.describe()
df_2010.describe() - df_2000.describe()
As we can see from summary statistics both mean and median of total fertility rate decreased in 2000-2010 together with maternal mortality, while the indices of health economy and median age grew. We can now use scatter plot to explore relations between maternal mortality and new variables.
Number of births attended by skilled health staff
df_list = [(df_2000, '2000'), (df_2005, '2005'), (df_2010, '2010')]
for df in df_list:
plot = df[0].plot.scatter(x='mat_mort', y='att_births',
title = 'Maternal mortality vs Birth attended by skilled staff' + " in " + df[1])
plot.set(xlabel = 'Maternal mortality', ylabel = 'Attended births, %')
#correlation coefficients
for df in df_list:
print(df[1], df[0]['mat_mort'].corr(df[0]['att_births']))
Total health spending
for df in df_list:
plot = df[0].plot.scatter(x='mat_mort', y='health_spend',
title = 'Maternal mortality vs Total health spending' + " in " + df[1])
plot.set(xlabel = 'Maternal mortality', ylabel = 'Health spending, USD')
#Changing the scale to logarithmic to check for linearity
for df in df_list:
plot = df[0].plot.scatter(x='mat_mort', y='health_spend',
title = 'Maternal mortality vs Total health spending' + " in " + df[1])
plot.set(xlabel = 'Maternal mortality', ylabel = 'Health spending, USD')
plot.set_yscale('log')
#correlation coefficients
for df in df_list:
print(df[1], df[0]['mat_mort'].corr(np.log(df[0]['health_spend'])))
Goverment share of total health spending
for df in df_list:
plot = df[0].plot.scatter(x='mat_mort', y='gov_share',
title = 'Maternal mortality vs Government share of health spending' + " in " + df[1])
plot.set(xlabel = 'Maternal mortality', ylabel = 'Government share, %')
#correlation coefficients
for df in df_list:
print(df[1], df[0]['mat_mort'].corr(df[0]['gov_share']))
Total fertility rate (babies per woman)
for df in df_list:
plot = df[0].plot.scatter(x='mat_mort', y='total_fert',
title = 'Maternal mortality vs Total fertility rate' + " in " + df[1])
plot.set(xlabel = 'Maternal mortality', ylabel = 'Total fertility rate')
#correlation coefficients
for df in df_list:
print(df[1], df[0]['mat_mort'].corr(df[0]['total_fert']))
Median age of the population
for df in df_list:
plot = df[0].plot.scatter(x='mat_mort', y='median_age',
title = 'Maternal mortality vs Median Age' + " in " + df[1])
plot.set(xlabel = 'Maternal mortality', ylabel = 'Median age, years')
#Changing the scale to logarithmic to check for linearity
for df in df_list:
plot = df[0].plot.scatter(x='mat_mort', y='median_age',
title = 'Maternal mortality vs Median Age' + " in " + df[1])
plot.set(xlabel = 'Maternal mortality', ylabel = 'Median age, years')
plot.set_xscale('log');
#correlation coefficients
for df in df_list:
print(df[1], df[0]['mat_mort'].corr(df[0]['median_age'], method='kendall'))
As can be seen from the charts and correlation coefficients above, there is a strong positive correlation between maternal mortality and total fertility rate (number of children per woman), and also a strong negative correlation between maternal mortality and number of births attended by skilled health staff. There is also negative correlation that tends to grow over time between maternal mortality and median age of the country population, though from scatter plot the relation appears to be non-linear. The non-linear relationship can also be seen between maternal mortality and total health spendings. The government share of health spending also shows moderate negative correlation with maternal mortality.
The significance of such relations remains to be estimated. However, possible interrelations between independent variables should also be considered. For example, the countries, where median age of the populaiton is higher, are typically also the developed countries that already went through the second demographic transition, thus having lower total fertility rate and higher health spendings in absolute numbers. We can see some support to this statement in the following correlation coefficients.
# Interrelations between independent variables
# median age vs total health spending
for df in df_list:
print(df[1], df[0]['median_age'].corr(df[0]['health_spend'], method='kendall'))
# median age vs government share of health spending
for df in df_list:
print(df[1], df[0]['median_age'].corr(df[0]['gov_share']))
# total fertility rate vs median age
for df in df_list:
print(df[1], df[0]['total_fert'].corr(df[0]['median_age']))
# total health spending vs number of birth attended by skilled health staff
for df in df_list:
print(df[1], df[0]['health_spend'].corr(df[0]['att_births'], method='kendall'))
# government share vs number of birth attended by skilled health staff
for df in df_list:
print(df[1], df[0]['gov_share'].corr(df[0]['att_births']))
Thus, median age of the population has rather strong correlation with total health spendings, especially in recent years, and moderate correlation with the government share of total health spending, while its correlation with total fertility rate is negative and strong. Also the number of births attended by skilled health staff has moderate positive correlation with both total health spending and its government share.
The exploratory analysis of the data set has shown that during 1980-2013 years maternal mortality in the world has decreased significantly. The distribution of observations, which remained positively skewed over years, became narrower and more concentrated below 100 cases per 100000 live births. The mean mortality ratio over the world dropped from 362.3 in 1980 to 162.8 in 2013 (or by 55.1%). The median decreased from 169.5 in 1980 to 64 in 2013 (by 62.2%, mostly during 1980-1990). The mean was decreasing gradually during the whole period in question, while the median dropped most significantly in 1980-1990.
The share of countries having maternal mortality ratios lower than world average remained at 67-68% in 1990-2013, and was only several percent lower in 1980 (63%). Meanwhile the number of countries with maternal mortality ratio lower than 100 increased from 37.8% in 1980 to 60% in 2013.
West Europe is the geographical leader in terms of lowest ratios, followed by East Europe, while Sub-Saharan Africa remained the example of the highest values, though its mean and medians decreased by 38% and 35% respectively during the period in question. The significant improvements can be seen in Asian regions (by about 80% for the means) and North Africa (by 78%).
Combining the economical and geographical characteristics we can conclude that sometimes geography may been considered the prevailing factor in the same income group. Thus the ratios in upper middle income east european countries are usually smaller than of high income countries in both Americas while upper middle income countries in Africa and South America still have much higher maternal mortality, though with a great decrease since 1980. Here the differences in healthcare systems of the countries and regions may have their impact. In general, both regional and income characteristics can't be implied on its own, but only as a reflection of the combinations of demographical, social and economical parameters developed in specific country groups over time.
Among other factors which may be considered of more direct influence on maternal mortality, the following were explored: the number of births attended by skilled health staff, total fertility rate (number of babies per woman) and median age of the population, total health spendings in US dollars and government share of health spendings as percentage. The EDA was limited to only three year - 2000, 2005 and 2010 - because of missing data.
A strong positive correlation can be seen between maternal mortality and total fertility rate and also a strong negative correlation between maternal mortality and number of births attended by skilled health staff. There is also negative correlation that tends to grow over time between maternal mortality and median age of the population, though from scatter plot the relation appears to be non-linear. The non-linear negative relationship can also be seen between maternal mortality and total health spending. The significance of the correlation coefficients remains to be estimated. Also the interrelations between independent variables should be taken into account in further analysis and modelling.