Various mental disorders are a widely spread phenomenon: approximately 1 in 5 adults (21%, 52.9 million people) in the US experienced mental illness in 20201 and up to 1 billion people across the globe suffer from various mental disorders. Loss in productivity due to anxiety and depression — two most common conditions — alone costs 1 trillion US dollars annually worldwide2. Still, not enough attention is paid to mental health and not enough healthcare resources are assigned to improve the situation. As one of the solutions in the US, Open Sourcing Mental Health a non-profit corporation was founded in 2013. The purpose is to raise awareness, educate, and provide resources to support mental wellness in the tech and open source communities3. Since 2014, the corporation organizes surveys to investigate and understand the status of mental health as well as attitudes towards mental health and frequency of mental health disorders in the tech industry. This project is dedicated to the analysis of these surveys’ data, acquired in 2014-2019.
2 Overview and Inspection
In this project, “Mental Health in the Tech Industry” database from Kaggle, which covers surveys conducted in 2014-2019, was used. The database consists of 3 tables: ‘Answer’, ‘Question’, and ‘Survey’:
Table ‘Survey’ contains the names of surveys, which were performed in different years. The column with the survey ID indicates the year of investigation. The data spans from 2014 to 2019 (excluding the year 2015).
Table ‘Question’ contains unique 105 questions used in all the surveys. Question IDs range from 1 to 118 and IDs between 35 and 47 are missing. Twelve questions are common for all surveys, some are used just in a single survey, but most are common for a subset of surveys.
Table ‘Answer’ contains the answers to the questions of each survey in a long-format table (one row per combination of year, question, and respondent). Survey ID ranges from 2014 to 2019 with 5 unique values, user ID from 1 to 4218 with 4218 unique responses (filled survey questionnaires)^4, and question ID from 1 to 118 with 105 unique values, so these numbers match data from the other tables.
In the answer text field of this table, there are some strange values, e.g., “¯\(°_o)/¯”, that might need further investigation, if those questions are included in the analysis.
There are no NaN/Null values:
Due to the table form (long format), structural missing values are not present. It would be the case if there either was one record per user or per user and survey combination.
Unanswered/Skipped questions are either not included or missing values are coded in a different way (e.g., as -1: we will see this in another section).
Find the details in the subsections below.
2.1 Setup
Code
# Packages and modules -------------------------------# Typingfrom typing import Union# To connect SQL databaseimport sqlalchemy as sqa# Data wranglingimport pandas as pd# Visualizationimport matplotlib.pyplot as pltimport seaborn as sns# Missing value identificationimport missingno as msno# Custom functionsimport functions as my# Settings --------------------------------------------# Default plot optionsplt.rc("figure", titleweight="bold")plt.rc("axes", labelweight="bold", titleweight="bold")plt.rc("font", weight="normal", size=10)plt.rc("figure", figsize=(7, 3))# Pandas optionspd.set_option("display.max_colwidth", None)pd.set_option("display.max_rows", 1000)pd.set_option("display.max_columns", 100)# colorsgreen, blue, orange, red ="tab:green", "tab:blue", "tab:orange", "tab:red"
Functions, that are generalizable and may be used in several projects, are in a separate file functions.py. Ad hoc functions specific to this dataset are defined in this document:
Code
# -- Custom ad-hoc functions# Convert to categoricaldef to_category(x: pd.Series, option: str) -> pd.Series:"""Convert a variable to a categorical one based on the provided option. Args: x (pd.Series): Original variable. option (str): Option to determine the categories for conversion. Possible options are: - 'yes_no': Convert to categories ['Yes', 'No'] - 'yes_maybe_no': Convert to categories ['Yes', 'Maybe', 'No'] - 'yes_no_dk': Convert to categories ['Yes', 'No', 'Don't know'] - 'yes_no_ne_dk': Convert to categories ['Yes', 'No', 'Not eligible for coverage', 'Don't know'] Returns: pd.Series: Variable with the new data type as a categorical variable. Raises: ValueError: If an invalid option is provided. """if option =="yes_no": categories = ["Yes", "No"]elif option =="yes_maybe_no": categories = ["Yes", "Maybe", "No"]elif option =="yes_no_dk": categories = ["Yes", "No", "Don't know"]elif option =="yes_no_ne_dk": categories = ["Yes", "No", "Not eligible for coverage", "Don't know"]else:raiseValueError(f"Invalid option: {option}")return pd.Categorical(x, categories=categories)# Get question text from 'Question' table by IDdef print_questions(id: Union[int, float, tuple[int, ...]]) ->None:"""Show question text by id Args: id (integer or tuple of integers): Question IDs in 'Question' table. """iftype(id) in [int, float]:id=f"({int(id)})" s =""else: s ="s" questions = pd.read_sql(f"""--sql SELECT questionid ID, questiontext 'Question' FROM Question WHERE QuestionID IN {id}; """, db_engine, )print(f"Question{s}:") display(questions)
2.2 Inspection of Tables in Database
Code
# Create a connection to the database.db_engine = sqa.create_engine("sqlite:///data/mental_health.sqlite")
Code
# List tablessqa.inspect(db_engine).get_table_names()
['Answer', 'Question', 'Survey']
Let’s look at each table separately.
Table ‘Survey’ is small and self-explanatory:
Code
pd.read_sql("SELECT COUNT(*) n_rows FROM Survey;", db_engine)
n_rows
0
5
Code
ds_survey = pd.read_sql("SELECT * FROM Survey;", db_engine)ds_survey
SurveyID
Description
0
2014
mental health survey for 2014
1
2016
mental health survey for 2016
2
2017
mental health survey for 2017
3
2018
mental health survey for 2018
4
2019
mental health survey for 2019
Table ‘Question’ contains the text of all questions:
Code
ds_questions = pd.read_sql("SELECT * FROM Question;", db_engine)ds_questions.head()
questiontext
questionid
0
What is your age?
1
1
What is your gender?
2
2
What country do you live in?
3
3
If you live in the United States, which state or territory do you live in?
In this subsection, we will try to answer the following questions:
Are the number of questions in each survey the same?
Are there any questions common for all surveys? Which ones?
Code
incl_questions = pd.read_sql("""--sql SELECT DISTINCT SurveyID year, QuestionID, 'True' response FROM Answer; """, db_engine,)incl_questions_by_year = incl_questions.pivot_table( index="QuestionID", columns="year", values="response", aggfunc="first",)# Preview reshaped datasetprint("Bottom rows of the reshaped dataset:\n")print(incl_questions_by_year.tail())# Plot missing valuesax = msno.matrix(incl_questions_by_year, sparkline=False, figsize=(8, 4))ax.set_xlabel("Year")ax.set_ylabel("Index of Row with Question")ax.set_title("Questions Included in the Surveys \nby Year", fontsize=20);
Bottom rows of the reshaped dataset:
year 2014 2016 2017 2018 2019
QuestionID
114 NaN True NaN NaN NaN
115 NaN True NaN NaN NaN
116 NaN True NaN NaN NaN
117 NaN True NaN NaN NaN
118 NaN True NaN NaN NaN
Each row represents a unique question: if in the plot, the row is represented by a dark square then the question is included while rectangles represent questions that are not present in a particular survey.
It seems that the 2017, 2018, and 2019 year surveys use the same questionnaires and around half of the questions of these surveys overlap with questions of the 2016 year survey.
There are 12 questions common for all surveys, and there are 31 more questions that are common to the 2016-2019 year surveys.
Code
# -- IDs of questions common for all the surveysdef get_all_true_rows(data): is_all_true = (~data.isnull()).all(axis=1)return [i for i in data[is_all_true].index]common_question_id = get_all_true_rows(incl_questions_by_year)print(f"N common questions for all the surveys: {len(common_question_id)}")print("IDs of these questions:")common_question_id
N common questions for all the surveys: 12
IDs of these questions:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Additionally, questions that are common for 2016-2019 surveys (excluding the questions listed above):
Code
# -- IDs of questions common just for 2016-2019 surveyscommon_question_id_2 = get_all_true_rows( incl_questions_by_year.drop(2014, axis=1))q_id =list(set(common_question_id_2) -set(common_question_id))print(f"The number of common questions just to the 2016-2019 surveys: {len(q_id)}")print("IDs of these questions:")print(q_id[:20])print(q_id[20:])
The number of common questions just to the 2016-2019 surveys: 31
IDs of these questions:
[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
[33, 34, 48, 49, 50, 51, 52, 53, 54, 55, 56]
If needed, you can find these questions listed in Supplement B
(file supplement-b--common-questions.ipynb)
3 Pre-Processing and Exploratory Data Analysis
Due to the large number of questions, it was decided to concentrate mainly on the subset of questions that are common either to all surveys or to all surveys except from the first one (year 2014). More concretely, these topics will be the focus of this analysis:
The activity of respondents to participate in each survey.
Profile of a typical respondent, who participated in the survey.
Mental health and productivity.
Factors related to MHD treatment seeking.
Before delving into analysis, data must get prepared. So let’s begin with pre-processing.
NOTE: In my opinion, sections “Mental Health and Productivity” and “Factors Related to MHD Treatment Seeking” are the most important, so spend enough time exploring these.
3.1 Pre-Processing
Most of the data pre-processing steps are presented in this section. But some are in the other subsections of data analysis if this looked more convenient.
Frequent pre-processing steps:
-1 represents a missing value so it is recoded into one.
in “Yes/No” questions, 0 recoded into No, and 1 into Yes.
Age. In the dataset, there are negative as well as extremely large (>300) age values. It was decided, to include values between 18 and 80 years old:
Code
# Age: all_valuesv_age = pd.read_sql("""--sql SELECT UserID user_id, CAST(AnswerText AS INT) AS age FROM Answer WHERE QuestionID = 1; """, db_engine,)v_age.drop("user_id", axis=1).describe().T
count
mean
std
min
25%
50%
75%
max
age
4218.0
33.915363
10.478054
-29.0
28.0
33.0
38.0
329.0
Code
# -- Extreme values of ageprint("Extreme values:")pd.read_sql("""--sql WITH q1_int AS ( SELECT CAST(AnswerText AS INT) age_value FROM Answer WHERE QuestionID = 1 ) SELECT age_value, COUNT(age_value) count FROM q1_int WHERE age_value NOT BETWEEN 20 AND 65 GROUP BY age_value ORDER BY age_value; """, db_engine,).T
Extreme values:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
age_value
-29
-1
0
3
5
8
11
15
17
18
19
66
67
70
72
74
99
323
329
count
1
5
1
1
1
1
1
1
1
9
20
2
2
1
1
1
1
1
1
Code
plt.figure(figsize=(8, 3))sns.boxplot(x="age", data=v_age).set(title="Age Distribution (Raw Data)")
[Text(0.5, 1.0, 'Age Distribution (Raw Data)')]
Code
# -- Filter out age of interest (18-80 y.o.)q_age = pd.read_sql("""--sql WITH q1_int AS ( SELECT UserID user_id, CAST(AnswerText AS INT) age FROM Answer WHERE QuestionID = 1 ) SELECT user_id, -- Age CASE WHEN age BETWEEN 18 and 80 THEN age ELSE NULL END AS age, -- Age group: every ~5 year CASE WHEN age >= 18 AND age < 26 THEN "18-25" WHEN age >= 26 AND age < 31 THEN "26-30" WHEN age >= 31 AND age < 36 THEN "31-35" WHEN age >= 36 AND age < 41 THEN "36-40" WHEN age >= 41 AND age < 46 THEN "41-45" WHEN age >= 46 AND age < 51 THEN "46-50" WHEN age >= 51 AND age < 56 THEN "51-55" WHEN age >= 56 AND age < 61 THEN "56-60" WHEN age >= 61 AND age < 66 THEN "61-65" WHEN age >= 66 AND age < 71 THEN "66-70" WHEN age >= 71 AND age < 76 THEN "71-75" ELSE NULL END AS age_group, -- Age group: 3 categories, CASE WHEN age >= 18 AND age < 36 THEN "18-35" WHEN age >= 36 AND age < 76 THEN "36-75" ELSE NULL END AS age_group_2 FROM q1_int """, db_engine,)
Gender. In the dataset, there were many various values to describe gender. For pre-processing, all values were lowercase, the words “female” and “femmina” were treated as “female”, words “male”, “cishet male”, “masculine”, and “masculino” as “male”. All remaining were assigned to the category “Other/Unknown”.
Code
# Genderq_gender = pd.read_sql("""--sql WITH gender_lower AS ( SELECT UserID user_id, LOWER(AnswerText) gender_answer, SurveyID year FROM Answer WHERE QuestionID = 2 ) SELECT user_id, year, -- Pre-process gender values. -- If needed, use original values from "gender_answer" for checking (CASE -- female options WHEN gender_answer LIKE "female" OR gender_answer LIKE "femmina" THEN "Female" -- male options WHEN gender_answer LIKE "male" OR gender_answer LIKE "masculine" OR gender_answer LIKE "masculino" OR gender_answer LIKE "cishet male" THEN "Male" -- other ELSE "Other/Unknown" END) AS gender FROM gender_lower;""", db_engine,)
Country of residence and region. Country names were recoded in the following way:
‘United States’ and ‘United States of America’ were merged into ‘USA’
‘United Kingdom’ shortened to ‘UK’
Values -1 were treated as missing values.
Additional variables were created too:
country_major_ctb (countries that are major contributors to a survey): in this variable, least frequently occurring countries were merged into “Other”
region: a variable to group countries into regions like North America, Asia or Europe.
Code
# -- Country of residenceq_country_live = pd.read_sql("""--sql -- Use unified or shorter country names WITH countries_recoded AS ( SELECT UserID user_id, CASE WHEN AnswerText IN ('United States', 'United States of America') THEN 'USA' WHEN AnswerText IN ('United Kingdom') THEN 'UK' WHEN AnswerText IN (-1 , 'Unknown') OR AnswerText IS NULL THEN NULL ELSE AnswerText END AS country FROM answer WHERE QuestionID = 3 ), -- Filter out countries itch contributed least major_contributor AS ( SELECT country, COUNT(country) count FROM countries_recoded GROUP BY country HAVING COUNT(country) >= 50 ORDER BY count DESC ) SELECT user_id, country, -- Only most contributing countries CASE WHEN country IN (SELECT country from major_contributor) THEN country WHEN country IN (-1, 'Unknown') OR country IS NULL THEN NULL ELSE 'Other' END AS country_major_ctb, -- Region CASE WHEN country IN ( "Algeria", "Ethiopia", "Ghana", "Nigeria", "Kenya", "South Africa", "Mauritius", "Zimbabwe") THEN 'Africa' WHEN country IN ("Australia", "New Zealand") THEN 'Australasia' WHEN country IN ( "China", "Georgia", "India", "Israel", "Japan", "Philippines", "Singapore", "Thailand", "Pakistan", "Turkey", "Iran", "Afghanistan", "Indonesia", "Bangladesh", "Hong Kong", "Saudi Arabia", "Jordan", "Brunei", "Taiwan", "Vietnam") THEN 'Asia' WHEN country IN ( "Austria", "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia", "Czech Republic", "Denmark", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Moldova", "Serbia", "Ukraine", "Belarus", "Netherlands", "Slovakia", "Estonia", "Iceland", "Macedonia", "Norway", "Poland", "Portugal", "Romania", "Slovenia", "Spain", "Sweden", "Switzerland", "UK", --Europe "Russia") -- Both Asia and Europe THEN 'Europe' WHEN country IN ("USA", "Canada", "Bahamas, The") THEN 'North America' WHEN country IN ( "Mexico", "Brazil", "Costa Rica", "Colombia", "Uruguay", "Guatemala", "Ecuador", "Argentina", "Chile", "Venezuela") THEN 'Latin America' WHEN country IN ('Other', 'Unknown') THEN NULL ELSE country END AS region FROM countries_recoded; """, db_engine,)
Other variables of interest were also prepared for analysis. In some cases, similar values were merged into a single category. But usually, values were recoded into human-readable form (like “Yes”, “No”, etc.) and converted to categorical variables to keep the order of categories constant. Some values were pre-pended with letters (A, B, etc.) to be sorted in alphabetical order.
Code
# -- Mental heathq_have_mhd = pd.read_sql("""--sql SELECT UserID user_id, CASE -- "Maybe", "Possibly", "Don't Know" looks like the same state -- of uncertainty about diagnosis WHEN AnswerText IN ("Maybe", "Possibly", "Don't Know") THEN "Don't know" WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS have_mhd FROM Answer WHERE QuestionID = 33; """, db_engine,)q_have_mhd.have_mhd = to_category(q_have_mhd.have_mhd, "yes_no_dk")q_ever_diagnosed_mhd = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS ever_diagnosed_mhd FROM Answer WHERE QuestionID = 34; """, db_engine,)q_ever_diagnosed_mhd.ever_diagnosed_mhd = to_category( q_ever_diagnosed_mhd.ever_diagnosed_mhd, "yes_no")q_sought_professional_treatment = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = 0 THEN 'No' WHEN AnswerText = 1 THEN 'Yes' WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS ever_sought_prof_mhd_treatment FROM Answer WHERE QuestionID = 7; """, db_engine,)q_sought_professional_treatment.ever_sought_prof_mhd_treatment = to_category( q_sought_professional_treatment.ever_sought_prof_mhd_treatment, "yes_no")
Code
# -- Company relatedq_self_employed = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = 0 THEN 'No' WHEN AnswerText = 1 THEN 'Yes' WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS self_employed FROM Answer WHERE QuestionID = 5; """, db_engine,)q_self_employed.self_employed = to_category( q_self_employed.self_employed, "yes_no")q_company_size = pd.read_sql("""--sql SELECT UserID user_id, -- Company size: 6 categories CASE WHEN AnswerText = '1-5' THEN 'A. 1-5' WHEN AnswerText = '6-25' THEN 'B. 6-25' WHEN AnswerText = '26-100' THEN 'C. 26-100' WHEN AnswerText = '100-500' THEN 'D. 100-500' WHEN AnswerText = '500-1000' THEN 'E. 500-1000' WHEN AnswerText = 'More than 1000' THEN 'F. More than 1000' WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS company_size, -- Company size: 3 categories CASE WHEN AnswerText IN ('1-5', '6-25', '26-100') THEN 'A. 1-100' WHEN AnswerText IN ('100-500', '500-1000') THEN 'B. 100-1000' WHEN AnswerText = 'More than 1000' THEN 'C. More than 1000' WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS company_size_3 FROM Answer WHERE QuestionID = 8; """, db_engine,)q_is_tech_org = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = 0 THEN 'No' WHEN AnswerText = 1 THEN 'Yes' WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS is_tech_org FROM Answer WHERE QuestionID = 9; """, db_engine,)q_is_tech_org.is_tech_org = to_category(q_is_tech_org.is_tech_org, "yes_no")q_is_tech_it_role = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = 0 THEN 'No' WHEN AnswerText = 1 THEN 'Yes' WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS is_tech_it_role FROM Answer WHERE QuestionID = 13; """, db_engine,)q_is_tech_it_role.is_tech_it_role = to_category( q_is_tech_it_role.is_tech_it_role, "yes_no")
Code
# -- Company mental healthcare resourcesq_employer_teach = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText IN ("I don't know", "Don't know") THEN "Don't know" WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS employer_teach FROM Answer WHERE QuestionID = 16; """, db_engine,)q_employer_teach.employer_teach = to_category( q_employer_teach.employer_teach, "yes_no_dk")q_employer_mental_healthcare = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText IN ("I don't know", "Don't know") THEN "Don't know" WHEN AnswerText IN ("Not eligible for coverage / NA") THEN "Not eligible for coverage" WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS employer_mental_healthcare FROM Answer WHERE QuestionID = 10; """, db_engine,)q_employer_mental_healthcare.employer_mental_healthcare = to_category( q_employer_mental_healthcare.employer_mental_healthcare, "yes_no_ne_dk")q_anonym_protection_to_treat = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText IN ("I don't know", "Don't know") THEN "Don't know" WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS anonym_protection_to_treat FROM Answer WHERE QuestionID = 11; """, db_engine,)q_anonym_protection_to_treat.anonym_protection_to_treat = to_category( q_anonym_protection_to_treat.anonym_protection_to_treat, "yes_no_dk")
Code
# -- Atmosphere for opennessq_disclose_interview = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS disclose_in_interview FROM Answer WHERE QuestionID = 12; """, db_engine,)q_disclose_interview.disclose_in_interview = to_category( q_disclose_interview.disclose_in_interview, "yes_maybe_no")q_discus_with_coworkers = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS discus_with_coworkers FROM Answer WHERE QuestionID = 18; """, db_engine,)q_discus_with_coworkers.discus_with_coworkers = to_category( q_discus_with_coworkers.discus_with_coworkers, "yes_maybe_no")q_discus_with_supervisor = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS discus_with_supervisor FROM Answer WHERE QuestionID = 19; """, db_engine,)q_discus_with_supervisor.discus_with_supervisor = to_category( q_discus_with_supervisor.discus_with_supervisor, "yes_maybe_no")# -- Atmosphere for openness: long DFq_openness = pd.read_sql("""--sql SELECT UserID user_id, 'in interview' discuss_mental_issue, CASE WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS choice FROM Answer WHERE QuestionID = 12 UNION ALL SELECT UserID user_id, 'with coworkers' discuss_mental_issue, CASE WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS choice FROM Answer WHERE QuestionID = 18 UNION ALL SELECT UserID user_id, 'with supervisor' discuss_mental_issue, CASE WHEN AnswerText = -1 THEN NULL ELSE AnswerText END AS choice FROM Answer WHERE QuestionID = 19; """, db_engine,)q_openness.choice = to_category(q_openness.choice, "yes_maybe_no")
Code
# -- Group 'Not applicable to me' was excludedq_productivity = pd.read_sql("""--sql SELECT UserID user_id, CASE WHEN AnswerText = 'Yes' THEN "Often" WHEN AnswerText = 'Unsure' THEN "Unsure" WHEN AnswerText = 'No' THEN "Rarely" WHEN AnswerText IN (-1, 'Not applicable to me') THEN NULL ELSE AnswerText END AS affected_productivity FROM Answer WHERE QuestionID = 54; """, db_engine,)q_productivity.affected_productivity = pd.Categorical( q_productivity.affected_productivity, categories=["Often", "Unsure", "Rarely"],)
Merge the variables of interest into a long-format data frame (one row per user ID):
Code
ds = my.merge_all( [# -- Respondent profile q_gender, q_age, q_country_live,# -- Mental health status q_have_mhd, q_ever_diagnosed_mhd, q_sought_professional_treatment,# -- Company/Org./Role profile q_self_employed, q_company_size, q_is_tech_org, q_is_tech_it_role,# -- Mental health services at work q_employer_teach, q_employer_mental_healthcare, q_anonym_protection_to_treat,# -- Atmosphere for openness q_disclose_interview, q_discus_with_coworkers, q_discus_with_supervisor,# -- Productivity q_productivity, ])
In this subsection, we will look into the questionnaire sizes used and the counts of respondents, who participated in each survey.
As there is no information in the database’s description on duplicate responses, thus in this analysis we assume that each response (i.e., a filled survey) equals the number of respondents.
Code
# N question and respondents' activitydf_y_q = pd.read_sql("""--sql WITH by_year AS ( SELECT SurveyID year, COUNT(DISTINCT QuestionID) n_questions, COUNT(DISTINCT UserID) n_respondents FROM Answer GROUP BY year ) SELECT year, n_questions, n_respondents, -- To treat result as float, 0.0 was added: ROUND((n_respondents + 0.0)/n_questions, 1) respondents_per_question FROM by_year; """, db_engine,)
Over time questionnaire size increased, but respondents’ activity decreased. The decreasing trend is visible for both absolute and relative activity. An exception is in the year 2016 when the absolute number of respondents increased. This might be related to the fact that in the year 2015, there was no poll, or to other factors, which were not investigated (e.g., poll advertising strategies).
3.3 Respondent’s Profile
In the surveys, there were more male (72%) than female (24%) respondents. The mean age was around 34 (with a standard deviation = 8) years old. Dominant countries of residence were the USA (62%) and the UK (11%). Dominant regions of residence were North America (67%) and Europe (26%). In the 2016 survey, roughly half and in the 2017-2019 surveys, almost all (>90%) respondents were those, who had ever had an MHD diagnosis, but at the time of the surveys, just around 40% identified themselves as currently having an MHD. The respondents WERE either self-employed or worked in various size companies. The companies were mainly (78%) from the tech industry. And the respondents were mainly (94%) working as tech/IT specialists.
Approximately 81% of respondents (out of 453) think that their mental illness affects their productivity often. Furthermore, based on the answers of 2031 respondents, productivity is less interfered, if the mental health issues are treated effectively.
This information was revealed by analyzing the answers to the following 3 questions:
Code
print_questions(54)
Question:
ID
Question
0
54
Do you believe your productivity is ever affected by a mental health issue?
Code
f_tbl = my.calc_counts_and_percentages("affected_productivity", ds)my.plot_counts_with_labels( f_tbl, x_lab="", rot=0, title="How Often Productivity Is Affected\nby Mental Illness", color=[orange, blue, blue],)f_tbl
affected_productivity
n
percent
0
Often
367
81.0%
1
Unsure
60
13.2%
2
Rarely
26
5.7%
Code
print_questions((48, 49))
Questions:
ID
Question
0
48
If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?
1
49
If you have a mental health disorder, how often do you feel that it interferes with your work when not being treated effectively (i.e., when you are experiencing symptoms)?
Code
# -- Group 'Not applicable to me' was excludedq_48_49 = pd.read_sql("""--sql WITH recoded_48_49 AS ( SELECT UserID user_id, SurveyID year, CASE WHEN AnswerText = 'Often' THEN "D. Often" WHEN AnswerText = 'Sometimes' THEN "C. Sometimes" WHEN AnswerText = 'Rarely' THEN "B. Rarely" WHEN AnswerText = 'Never' THEN "A. Never" WHEN AnswerText = 'Not applicable to me' THEN NULL ELSE AnswerText END AS productivity_interference, QuestionID FROM Answer WHERE QuestionID IN (48, 49) ) SELECT user_id, year, 'Treated' AS treatment, productivity_interference FROM recoded_48_49 WHERE QuestionID = 48 UNION ALL SELECT user_id, year, 'Untreated' AS treatment, productivity_interference FROM recoded_48_49 WHERE QuestionID = 49; """, db_engine,)# N respondents# Non-relevant values were removed, the same people answered to both# questions on situation while being treated and without treatment.n = q_48_49.dropna()["user_id"].nunique()# Absolute countsct_abs = my.crosstab(q_48_49, "treatment", "productivity_interference")# Plotplt.rc("figure", figsize=(7, 3))my.plot_crosstab_as_barplot( ct_abs, title="Mental Illness and Interference with Productivity", ylim=[None, 70], color=[green, blue, orange, red],)plt.legend(frameon=False, loc="upper left")# Outputprint(f"After removing non-relevant answers, n={n} respondents were included.\n""Cross-tabulation:")ct_abs["Total"] = ct_abs.sum(numeric_only=True, axis=1)display(ct_abs)print("Percentage of group (treated/untreated) totals:")
After removing non-relevant answers, n=2031 respondents were included.
Cross-tabulation:
Percentage of group (treated/untreated) totals:
productivity_interference
A. Never
B. Rarely
C. Sometimes
D. Often
Total
treatment
Treated
165
700
808
166
1839
Untreated
24
113
672
1183
1992
3.3.2 Factors Related to MHD Treatment Seeking
First, let’s look at respondent’s willingness and openness to discuss MHD issues during job interview, with coworkers and with supervisor.
Would you bring up a mental health issue with a potential employer in an interview?
1
18
Would you feel comfortable discussing a mental health issue with your coworkers?
2
19
Would you feel comfortable discussing a mental health issue with your direct supervisor(s)?
choice
Yes
Maybe
No
discuss_mental_issue
in interview
231
1036
2951
with coworkers
682
1053
719
with supervisor
932
815
707
Most people are not willing to talk about mental health during a job interview. And comparing the two rest situations, there is a slightly bigger openness to discuss mental health topics with a supervisor than with coworkers.
Next, let’s explore the factors that might be related to the willingness to seek professional MHD help.
The main findings of the following analysis will be described in the section “Summary”.
Is your primary role within your company related to tech/IT?
ever_sought_prof_mhd_treatment
Yes
No
is_tech_it_role
Yes
891
580
No
66
34
Code
print_questions(16)c_tab = my.crosstab(ds, "employer_teach", "ever_sought_prof_mhd_treatment")my.plot_crosstab_as_barplot( c_tab, xlabel="Received MHD related training at work", title="Ever Sought Professional MHD Treatment", ylim=[None, 80],)plt.gca().legend(frameon=False, title="Ever sought MHD help", ncol=2)c_tab
Question:
ID
Question
0
16
Does your employer offer resources to learn more about mental health disorders and options for seeking help?
ever_sought_prof_mhd_treatment
Yes
No
employer_teach
Yes
475
227
No
627
457
Don't know
361
307
Code
print_questions(10)c_tab = my.crosstab( ds, "employer_mental_healthcare", "ever_sought_prof_mhd_treatment")my.plot_crosstab_as_barplot( c_tab, ylim=[None, 80], xlabel="Mental health benefits as health care coverage at work", title="Ever Sought Professional MHD Treatment",)plt.gca().legend(frameon=False, title="Ever sought MHD help", ncol=2)c_tab
Question:
ID
Question
0
10
Does your employer provide mental health benefits as part of healthcare coverage?
Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
ever_sought_prof_mhd_treatment
Yes
No
anonym_protection_to_treat
Yes
710
425
No
116
97
Don't know
1274
1092
4 Take Away Points
In this project, Mental Health in the Tech Industry dataset, which covers 2014-2019 year surveys, was explored. The findings are most applicable to the regions of North America and Europe especially the USA, as most of the respondents were from there.
The results show that effective MHD treatment may increase productivity at work.
In order to get professional treatment, people should look for professional help.
The factors positively related to seeking MHD help:
mental health benefits as part of healthcare coverage,
increased anonymity to treat MHD,
work in a bigger company,
older respondent’s age, and
provided education on MHD.
The factors that were irrelevant (no relationship found):
working in a technical/non-technical role or company.
More effort should be placed on improving the situation of the following groups as they are less prone to look for MHD-related help:
males,
younger people, and
Europeans (compared to North Americans).
Employees are more prone to reveal MHD issues to managers than to coworkers. So managers have more influence on the ways to increase productivity.
It seems that the situation is getting better over the years, but there is much to do to reach an optimal point.
Suggestions to Improve the Analysis
A broader analysis that includes more questions and more relationships between the answers could be performed, e.g.:
Factors such as race were not included but might be important.
Not only general MHD but also subtypes of it should be investigated.
In some analyses, more emphasis could be put on the cases, where the sample size or response rate is small, to indicate that these results are less reliable.