The Analysis of Mental Health in the Tech Industry

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2022-11-17

Updated

2023-07-28

Mental Health in the Tech Industry project logo. Generated with Leonardo.Ai.

Tools: SQL, Python
Helper tools: VSCode, Git
Skills:

Technical requirements:

Abbreviations

  • EDA – exploratory data analysis
  • IT – information technologies
  • MH – mental health
  • MHD – mental health disorder
  • n – sample/group size
  • tech – technology
  • UK – United Kingdom
  • USA, US – United States of America

1 Introduction

Various mental disorders are a widely spread phenomenon: approximately 1 in 5 adults (21%, 52.9 million people) in the US experienced mental illness in 20201 and up to 1 billion people across the globe suffer from various mental disorders. Loss in productivity due to anxiety and depression — two most common conditions — alone costs 1 trillion US dollars annually worldwide2. Still, not enough attention is paid to mental health and not enough healthcare resources are assigned to improve the situation. As one of the solutions in the US, Open Sourcing Mental Health a non-profit corporation was founded in 2013. The purpose is to raise awareness, educate, and provide resources to support mental wellness in the tech and open source communities3. Since 2014, the corporation organizes surveys to investigate and understand the status of mental health as well as attitudes towards mental health and frequency of mental health disorders in the tech industry. This project is dedicated to the analysis of these surveys’ data, acquired in 2014-2019.

2 Overview and Inspection

In this project, “Mental Health in the Tech Industry” database from Kaggle, which covers surveys conducted in 2014-2019, was used. The database consists of 3 tables: ‘Answer’, ‘Question’, and ‘Survey’:

  • Table ‘Survey’ contains the names of surveys, which were performed in different years. The column with the survey ID indicates the year of investigation. The data spans from 2014 to 2019 (excluding the year 2015).

  • Table ‘Question’ contains unique 105 questions used in all the surveys. Question IDs range from 1 to 118 and IDs between 35 and 47 are missing. Twelve questions are common for all surveys, some are used just in a single survey, but most are common for a subset of surveys.

  • Table ‘Answer’ contains the answers to the questions of each survey in a long-format table (one row per combination of year, question, and respondent). Survey ID ranges from 2014 to 2019 with 5 unique values, user ID from 1 to 4218 with 4218 unique responses (filled survey questionnaires)^4, and question ID from 1 to 118 with 105 unique values, so these numbers match data from the other tables.

    In the answer text field of this table, there are some strange values, e.g., “¯\(°_o)/¯”, that might need further investigation, if those questions are included in the analysis.

    There are no NaN/Null values:

    • Due to the table form (long format), structural missing values are not present. It would be the case if there either was one record per user or per user and survey combination.
    • Unanswered/Skipped questions are either not included or missing values are coded in a different way (e.g., as -1: we will see this in another section).

Find the details in the subsections below.

2.1 Setup

Code
# Packages and modules -------------------------------
# Typing
from typing import Union

# To connect SQL database
import sqlalchemy as sqa

# Data wrangling
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Missing value identification
import missingno as msno

# Custom functions
import functions as my

# Settings --------------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(7, 3))

# Pandas options
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 100)


# colors
green, blue, orange, red = "tab:green", "tab:blue", "tab:orange", "tab:red"

Functions, that are generalizable and may be used in several projects, are in a separate file functions.py. Ad hoc functions specific to this dataset are defined in this document:

Code
# -- Custom ad-hoc functions
# Convert to categorical
def to_category(x: pd.Series, option: str) -> pd.Series:
    """Convert a variable to a categorical one based on the provided option.

    Args:
        x (pd.Series): Original variable.
        option (str): Option to determine the categories for conversion.
        Possible options are:
            - 'yes_no': Convert to categories ['Yes', 'No']
            - 'yes_maybe_no': Convert to categories ['Yes', 'Maybe', 'No']
            - 'yes_no_dk': Convert to categories ['Yes', 'No', 'Don't know']
            - 'yes_no_ne_dk': Convert to categories
                       ['Yes', 'No', 'Not eligible for coverage', 'Don't know']

    Returns:
        pd.Series: Variable with the new data type as a categorical variable.

    Raises:
        ValueError: If an invalid option is provided.

    """
    if option == "yes_no":
        categories = ["Yes", "No"]
    elif option == "yes_maybe_no":
        categories = ["Yes", "Maybe", "No"]
    elif option == "yes_no_dk":
        categories = ["Yes", "No", "Don't know"]
    elif option == "yes_no_ne_dk":
        categories = ["Yes", "No", "Not eligible for coverage", "Don't know"]
    else:
        raise ValueError(f"Invalid option: {option}")

    return pd.Categorical(x, categories=categories)


# Get question text from 'Question' table by ID
def print_questions(id: Union[int, float, tuple[int, ...]]) -> None:
    """Show question text by id

    Args:
        id (integer or tuple of integers): Question IDs in 'Question' table.
    """
    if type(id) in [int, float]:
        id = f"({int(id)})"
        s = ""
    else:
        s = "s"

    questions = pd.read_sql(
        f"""--sql
        SELECT questionid ID, questiontext 'Question'
        FROM Question 
        WHERE QuestionID IN {id};
        """,
        db_engine,
    )
    print(f"Question{s}:")
    display(questions)

2.2 Inspection of Tables in Database

Code
# Create a connection to the database.
db_engine = sqa.create_engine("sqlite:///data/mental_health.sqlite")
Code
# List tables
sqa.inspect(db_engine).get_table_names()
['Answer', 'Question', 'Survey']

Let’s look at each table separately.

  • Table ‘Survey’ is small and self-explanatory:
Code
pd.read_sql("SELECT COUNT(*) n_rows FROM Survey;", db_engine)
n_rows
0 5
Code
ds_survey = pd.read_sql("SELECT * FROM Survey;", db_engine)
ds_survey
SurveyID Description
0 2014 mental health survey for 2014
1 2016 mental health survey for 2016
2 2017 mental health survey for 2017
3 2018 mental health survey for 2018
4 2019 mental health survey for 2019
  • Table ‘Question’ contains the text of all questions:
Code
ds_questions = pd.read_sql("SELECT * FROM Question;", db_engine)
ds_questions.head()
questiontext questionid
0 What is your age? 1
1 What is your gender? 2
2 What country do you live in? 3
3 If you live in the United States, which state or territory do you live in? 4
4 Are you self-employed? 5
Code
pd.read_sql(
    """--sql
    SELECT 
        COUNT(*) n_rows,
        COUNT(DISTINCT questiontext) n_questions,
        MIN(questionid) question_id_min,
        MAX(questionid) question_id_max
    FROM Question;
""",
    db_engine,
)
n_rows n_questions question_id_min question_id_max
0 105 105 1 118
Code
ds_questions.isnull().sum()
questiontext    0
questionid      0
dtype: int64
Code
# Missing IDs
set(range(1, ds_questions.questionid.max() + 1)) - set(ds_questions.questionid)
{35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47}
  • Table ‘Answer’ is the biggest one:
Code
pd.read_sql(
    """--sql
    SELECT COUNT(*) n_rows__in_answers_tbl FROM Answer;
    """,
    db_engine,
)
n_rows__in_answers_tbl
0 236898
Code
ds_answer = pd.read_sql("SELECT * FROM Answer;", db_engine)
ds_answer.head()
AnswerText SurveyID UserID QuestionID
0 37 2014 1 1
1 44 2014 2 1
2 32 2014 3 1
3 31 2014 4 1
4 31 2014 5 1
Code
ds_answer.isnull().sum()
AnswerText    0
SurveyID      0
UserID        0
QuestionID    0
dtype: int64
Code
ds_answer.agg(["min", "max", "nunique"])
AnswerText SurveyID UserID QuestionID
min \n 2014 1 1
max ¯\(°_o)/¯ 2019 4218 118
nunique 4215 5 4218 105

2.3 Inspection of Questionnaires

In this subsection, we will try to answer the following questions:

  • Are the number of questions in each survey the same?
  • Are there any questions common for all surveys? Which ones?
Code
incl_questions = pd.read_sql(
    """--sql
    SELECT DISTINCT SurveyID year, QuestionID, 'True' response 
    FROM Answer;
    """,
    db_engine,
)

incl_questions_by_year = incl_questions.pivot_table(
    index="QuestionID",
    columns="year",
    values="response",
    aggfunc="first",
)

# Preview reshaped dataset
print("Bottom rows of the reshaped dataset:\n")
print(incl_questions_by_year.tail())

# Plot missing values
ax = msno.matrix(incl_questions_by_year, sparkline=False, figsize=(8, 4))
ax.set_xlabel("Year")
ax.set_ylabel("Index of Row with Question")
ax.set_title("Questions Included in the Surveys \nby Year", fontsize=20);
Bottom rows of the reshaped dataset:

year       2014  2016 2017 2018 2019
QuestionID                          
114         NaN  True  NaN  NaN  NaN
115         NaN  True  NaN  NaN  NaN
116         NaN  True  NaN  NaN  NaN
117         NaN  True  NaN  NaN  NaN
118         NaN  True  NaN  NaN  NaN

Each row represents a unique question: if in the plot, the row is represented by a dark square then the question is included while rectangles represent questions that are not present in a particular survey.

It seems that the 2017, 2018, and 2019 year surveys use the same questionnaires and around half of the questions of these surveys overlap with questions of the 2016 year survey.

There are 12 questions common for all surveys, and there are 31 more questions that are common to the 2016-2019 year surveys.

Code
# -- IDs of questions common for all the surveys
def get_all_true_rows(data):
    is_all_true = (~data.isnull()).all(axis=1)
    return [i for i in data[is_all_true].index]


common_question_id = get_all_true_rows(incl_questions_by_year)
print(f"N common questions for all the surveys: {len(common_question_id)}")
print("IDs of these questions:")
common_question_id
N common questions for all the surveys: 12
IDs of these questions:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

Additionally, questions that are common for 2016-2019 surveys (excluding the questions listed above):

Code
# -- IDs of questions common just for 2016-2019 surveys

common_question_id_2 = get_all_true_rows(
    incl_questions_by_year.drop(2014, axis=1)
)
q_id = list(set(common_question_id_2) - set(common_question_id))

print(f"The number of common questions just to the 2016-2019 surveys: {len(q_id)}")
print("IDs of these questions:")
print(q_id[:20])
print(q_id[20:])
The number of common questions just to the 2016-2019 surveys: 31
IDs of these questions:
[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
[33, 34, 48, 49, 50, 51, 52, 53, 54, 55, 56]

If needed, you can find these questions listed in Supplement B
(file supplement-b--common-questions.ipynb)


3 Pre-Processing and Exploratory Data Analysis

Due to the large number of questions, it was decided to concentrate mainly on the subset of questions that are common either to all surveys or to all surveys except from the first one (year 2014). More concretely, these topics will be the focus of this analysis:

  1. The activity of respondents to participate in each survey.
  2. Profile of a typical respondent, who participated in the survey.
  3. Mental health and productivity.
  4. Factors related to MHD treatment seeking.

Before delving into analysis, data must get prepared. So let’s begin with pre-processing.

NOTE: In my opinion, sections “Mental Health and Productivity” and “Factors Related to MHD Treatment Seeking” are the most important, so spend enough time exploring these.

3.1 Pre-Processing

Most of the data pre-processing steps are presented in this section. But some are in the other subsections of data analysis if this looked more convenient.

Frequent pre-processing steps:

  • -1 represents a missing value so it is recoded into one.
  • in “Yes/No” questions, 0 recoded into No, and 1 into Yes.

Age. In the dataset, there are negative as well as extremely large (>300) age values. It was decided, to include values between 18 and 80 years old:

Code
# Age: all_values
v_age = pd.read_sql(
    """--sql
    SELECT UserID user_id, CAST(AnswerText AS INT) AS age
    FROM Answer 
    WHERE QuestionID = 1;
    """,
    db_engine,
)

v_age.drop("user_id", axis=1).describe().T
count mean std min 25% 50% 75% max
age 4218.0 33.915363 10.478054 -29.0 28.0 33.0 38.0 329.0
Code
# -- Extreme values of age
print("Extreme values:")
pd.read_sql(
    """--sql
    WITH q1_int AS (
        SELECT CAST(AnswerText AS INT) age_value
        FROM Answer
        WHERE QuestionID = 1
    )
    
    SELECT age_value, COUNT(age_value) count
    FROM q1_int 
    WHERE age_value NOT BETWEEN 20 AND 65
    GROUP BY age_value
    ORDER BY age_value;
    """,
    db_engine,
).T
Extreme values:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
age_value -29 -1 0 3 5 8 11 15 17 18 19 66 67 70 72 74 99 323 329
count 1 5 1 1 1 1 1 1 1 9 20 2 2 1 1 1 1 1 1
Code
plt.figure(figsize=(8, 3))
sns.boxplot(x="age", data=v_age).set(title="Age Distribution (Raw Data)")
[Text(0.5, 1.0, 'Age Distribution (Raw Data)')]

Code
# -- Filter out age of interest (18-80 y.o.)
q_age = pd.read_sql(
    """--sql
    WITH q1_int AS (
        SELECT UserID user_id, CAST(AnswerText AS INT) age
        FROM Answer 
        WHERE QuestionID = 1
    )
    SELECT
        user_id, 
        -- Age
        CASE 
            WHEN age BETWEEN 18 and 80 THEN age
            ELSE NULL
        END AS age,
        -- Age group: every ~5 year
        CASE
            WHEN age >= 18 AND age < 26 THEN "18-25"
            WHEN age >= 26 AND age < 31 THEN "26-30"
            WHEN age >= 31 AND age < 36 THEN "31-35"
            WHEN age >= 36 AND age < 41 THEN "36-40"
            WHEN age >= 41 AND age < 46 THEN "41-45"
            WHEN age >= 46 AND age < 51 THEN "46-50"
            WHEN age >= 51 AND age < 56 THEN "51-55"
            WHEN age >= 56 AND age < 61 THEN "56-60"
            WHEN age >= 61 AND age < 66 THEN "61-65"
            WHEN age >= 66 AND age < 71 THEN "66-70"
            WHEN age >= 71 AND age < 76 THEN "71-75"
            ELSE NULL
        END AS age_group,
        -- Age group: 3 categories,
        CASE
            WHEN age >= 18 AND age < 36 THEN "18-35"
            WHEN age >= 36 AND age < 76 THEN "36-75"
            ELSE NULL
        END AS age_group_2
    FROM q1_int
    """,
    db_engine,
)

Gender. In the dataset, there were many various values to describe gender. For pre-processing, all values were lowercase, the words “female” and “femmina” were treated as “female”, words “male”, “cishet male”, “masculine”, and “masculino” as “male”. All remaining were assigned to the category “Other/Unknown”.

Code
# Gender
q_gender = pd.read_sql(
    """--sql
    WITH gender_lower AS (
        SELECT UserID user_id, LOWER(AnswerText) gender_answer, SurveyID year
        FROM Answer
        WHERE QuestionID = 2
    )
        
    SELECT 
        user_id,
        year,
        -- Pre-process gender values. 
        -- If needed, use original values from "gender_answer" for checking
        (CASE 
            -- female options
            WHEN gender_answer LIKE "female"  OR
                 gender_answer LIKE "femmina"    
            THEN "Female"
            -- male options
            WHEN gender_answer LIKE "male"        OR
                 gender_answer LIKE "masculine"   OR
                 gender_answer LIKE "masculino"   OR
                 gender_answer LIKE "cishet male" 
            THEN "Male"
            -- other
            ELSE "Other/Unknown"
        END) AS gender
    FROM gender_lower;
""",
    db_engine,
)

Country of residence and region. Country names were recoded in the following way:

  • ‘United States’ and ‘United States of America’ were merged into ‘USA’
  • ‘United Kingdom’ shortened to ‘UK’
  • Values -1 were treated as missing values.

Additional variables were created too:

  • country_major_ctb (countries that are major contributors to a survey): in this variable, least frequently occurring countries were merged into “Other”
  • region: a variable to group countries into regions like North America, Asia or Europe.
Code
# -- Country of residence
q_country_live = pd.read_sql(
    """--sql
    -- Use unified or shorter country names
    WITH countries_recoded AS (
        SELECT 
            UserID user_id,
            CASE 
                WHEN AnswerText IN ('United States', 'United States of America') 
                THEN 'USA'
                WHEN AnswerText IN ('United Kingdom')
                THEN 'UK'
                WHEN AnswerText IN (-1 , 'Unknown') OR AnswerText IS NULL
                THEN NULL
                ELSE AnswerText
            END AS country
        FROM answer
        WHERE QuestionID = 3
    ),
    -- Filter out countries itch contributed least
    major_contributor AS (
        SELECT country, COUNT(country) count
        FROM countries_recoded
        GROUP BY country
        HAVING COUNT(country) >= 50
        ORDER BY count DESC
    )
    
    SELECT
        user_id,
        country,
        -- Only most contributing countries
        CASE 
            WHEN country IN (SELECT country from major_contributor) THEN country
            WHEN country IN (-1, 'Unknown') OR country IS NULL THEN NULL
            ELSE 'Other'
        END AS country_major_ctb,
        -- Region
        CASE
            WHEN country IN (
                 "Algeria", "Ethiopia", "Ghana", "Nigeria", "Kenya",
                 "South Africa", "Mauritius", "Zimbabwe") 
            THEN 'Africa'

            WHEN country IN ("Australia", "New Zealand") 
            THEN 'Australasia'
 
            WHEN country IN (
                "China", "Georgia", "India", "Israel", "Japan", 
                "Philippines", "Singapore", "Thailand", "Pakistan",
                "Turkey", "Iran", "Afghanistan", "Indonesia", 
                "Bangladesh", "Hong Kong", "Saudi Arabia", "Jordan",
                "Brunei", "Taiwan", "Vietnam") 
            THEN 'Asia'
            
            WHEN country IN (
                "Austria", "Belgium", "Bosnia and Herzegovina", "Bulgaria", 
                "Croatia",  "Czech Republic", "Denmark", "Finland", "France",
                "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia",
                "Lithuania", "Moldova", "Serbia", "Ukraine", "Belarus", 
                "Netherlands", "Slovakia", "Estonia", "Iceland", "Macedonia",
                "Norway", "Poland", "Portugal", "Romania", "Slovenia", 
                "Spain", "Sweden", "Switzerland", "UK", --Europe
                "Russia")  -- Both Asia and Europe
            THEN 'Europe'

            WHEN country IN ("USA", "Canada", "Bahamas, The") 
            THEN 'North America'
            
            WHEN country IN (
                "Mexico", "Brazil", "Costa Rica", "Colombia", "Uruguay", 
                "Guatemala", "Ecuador", "Argentina", "Chile", "Venezuela")
            THEN 'Latin America'
            
            WHEN country IN  ('Other', 'Unknown')
            THEN NULL
            
            ELSE country
        END AS region
        
    FROM countries_recoded;
    """,
    db_engine,
)

Other variables of interest were also prepared for analysis. In some cases, similar values were merged into a single category. But usually, values were recoded into human-readable form (like “Yes”, “No”, etc.) and converted to categorical variables to keep the order of categories constant. Some values were pre-pended with letters (A, B, etc.) to be sorted in alphabetical order.

Code
# -- Mental heath
q_have_mhd = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE 
            -- "Maybe", "Possibly", "Don't Know" looks like the same state
            -- of uncertainty about diagnosis
            WHEN AnswerText IN ("Maybe", "Possibly", "Don't Know")
            THEN "Don't know"
            WHEN AnswerText = -1 
            THEN NULL
            ELSE AnswerText
        END AS have_mhd
    FROM Answer 
    WHERE QuestionID = 33;
    """,
    db_engine,
)
q_have_mhd.have_mhd = to_category(q_have_mhd.have_mhd, "yes_no_dk")

q_ever_diagnosed_mhd = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE 
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS ever_diagnosed_mhd
    FROM Answer 
    WHERE QuestionID = 34;
    """,
    db_engine,
)
q_ever_diagnosed_mhd.ever_diagnosed_mhd = to_category(
    q_ever_diagnosed_mhd.ever_diagnosed_mhd, "yes_no"
)

q_sought_professional_treatment = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE 
            WHEN AnswerText = 0 THEN 'No'
            WHEN AnswerText = 1 THEN 'Yes'
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS ever_sought_prof_mhd_treatment
    FROM Answer 
    WHERE QuestionID = 7;
    """,
    db_engine,
)
q_sought_professional_treatment.ever_sought_prof_mhd_treatment = to_category(
    q_sought_professional_treatment.ever_sought_prof_mhd_treatment, "yes_no"
)
Code
# -- Company related
q_self_employed = pd.read_sql(
    """--sql
    SELECT
        UserID user_id,
        CASE 
            WHEN AnswerText = 0 THEN 'No'
            WHEN AnswerText = 1 THEN 'Yes'
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS self_employed
    FROM Answer 
    WHERE QuestionID = 5;
    """,
    db_engine,
)
q_self_employed.self_employed = to_category(
    q_self_employed.self_employed, "yes_no"
)

q_company_size = pd.read_sql(
    """--sql
    SELECT
        UserID user_id,
        -- Company size: 6 categories
        CASE
            WHEN AnswerText = '1-5' THEN 'A. 1-5'
            WHEN AnswerText = '6-25' THEN 'B. 6-25'
            WHEN AnswerText = '26-100' THEN 'C. 26-100'
            WHEN AnswerText = '100-500' THEN 'D. 100-500'
            WHEN AnswerText = '500-1000' THEN 'E. 500-1000'
            WHEN AnswerText = 'More than 1000' THEN 'F. More than 1000'
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS company_size,
        -- Company size: 3 categories
        CASE
            WHEN AnswerText IN ('1-5', '6-25', '26-100') THEN 'A. 1-100'
            WHEN AnswerText IN ('100-500', '500-1000') THEN 'B. 100-1000'
            WHEN AnswerText = 'More than 1000' THEN 'C. More than 1000'
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS company_size_3
    FROM Answer 
    WHERE QuestionID = 8;
    """,
    db_engine,
)

q_is_tech_org = pd.read_sql(
    """--sql
    SELECT
        UserID user_id,
        CASE 
            WHEN AnswerText = 0 THEN 'No'
            WHEN AnswerText = 1 THEN 'Yes'
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS is_tech_org
    FROM Answer 
    WHERE QuestionID = 9;
    """,
    db_engine,
)
q_is_tech_org.is_tech_org = to_category(q_is_tech_org.is_tech_org, "yes_no")

q_is_tech_it_role = pd.read_sql(
    """--sql
    SELECT
        UserID user_id,
        CASE 
            WHEN AnswerText = 0 THEN 'No'
            WHEN AnswerText = 1 THEN 'Yes'
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS is_tech_it_role
    FROM Answer 
    WHERE QuestionID = 13;
    """,
    db_engine,
)
q_is_tech_it_role.is_tech_it_role = to_category(
    q_is_tech_it_role.is_tech_it_role, "yes_no"
)
Code
# -- Company mental healthcare resources
q_employer_teach = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE 
            WHEN AnswerText IN ("I don't know", "Don't know") THEN "Don't know"
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS employer_teach
    FROM Answer 
    WHERE QuestionID = 16;
    """,
    db_engine,
)
q_employer_teach.employer_teach = to_category(
    q_employer_teach.employer_teach, "yes_no_dk"
)

q_employer_mental_healthcare = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE 
            WHEN AnswerText IN ("I don't know", "Don't know")
            THEN "Don't know"
            WHEN AnswerText IN ("Not eligible for coverage / NA")
            THEN "Not eligible for coverage"
            WHEN AnswerText = -1
            THEN NULL
            ELSE AnswerText
        END AS employer_mental_healthcare
    FROM Answer 
    WHERE QuestionID = 10;
    """,
    db_engine,
)
q_employer_mental_healthcare.employer_mental_healthcare = to_category(
    q_employer_mental_healthcare.employer_mental_healthcare, "yes_no_ne_dk"
)

q_anonym_protection_to_treat = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE 
            WHEN AnswerText IN ("I don't know", "Don't know") THEN "Don't know"
            WHEN AnswerText = -1 THEN NULL
            ELSE AnswerText
        END AS anonym_protection_to_treat
    FROM Answer 
    WHERE QuestionID = 11;
    """,
    db_engine,
)
q_anonym_protection_to_treat.anonym_protection_to_treat = to_category(
    q_anonym_protection_to_treat.anonym_protection_to_treat, "yes_no_dk"
)
Code
# -- Atmosphere for openness
q_disclose_interview = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE WHEN AnswerText = -1 THEN NULL 
             ELSE AnswerText
         END AS disclose_in_interview
    FROM Answer 
    WHERE QuestionID = 12;
    """,
    db_engine,
)
q_disclose_interview.disclose_in_interview = to_category(
    q_disclose_interview.disclose_in_interview, "yes_maybe_no"
)


q_discus_with_coworkers = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE WHEN AnswerText = -1 THEN NULL 
             ELSE AnswerText
        END AS discus_with_coworkers
    FROM Answer 
    WHERE QuestionID = 18;
    """,
    db_engine,
)
q_discus_with_coworkers.discus_with_coworkers = to_category(
    q_discus_with_coworkers.discus_with_coworkers, "yes_maybe_no"
)

q_discus_with_supervisor = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        CASE WHEN AnswerText = -1 THEN NULL
             ELSE AnswerText
        END AS discus_with_supervisor
    FROM Answer 
    WHERE QuestionID = 19;
    """,
    db_engine,
)
q_discus_with_supervisor.discus_with_supervisor = to_category(
    q_discus_with_supervisor.discus_with_supervisor, "yes_maybe_no"
)

# -- Atmosphere for openness: long DF
q_openness = pd.read_sql(
    """--sql
    SELECT
        UserID user_id, 
        'in interview' discuss_mental_issue,
        CASE WHEN AnswerText = -1 THEN NULL 
             ELSE AnswerText
         END AS choice
    FROM Answer 
    WHERE QuestionID = 12
    
    UNION ALL

    SELECT
        UserID user_id, 
        'with coworkers' discuss_mental_issue,
        CASE WHEN AnswerText = -1 THEN NULL 
             ELSE AnswerText
        END AS choice
    FROM Answer 
    WHERE QuestionID = 18
    
    UNION ALL
  
    SELECT
        UserID user_id,
        'with supervisor' discuss_mental_issue,
        CASE WHEN AnswerText = -1 THEN NULL
             ELSE AnswerText
        END AS choice
    FROM Answer 
    WHERE QuestionID = 19;
    """,
    db_engine,
)

q_openness.choice = to_category(q_openness.choice, "yes_maybe_no")
Code
# -- Group 'Not applicable to me' was excluded
q_productivity = pd.read_sql(
    """--sql
    SELECT
        UserID user_id,
        CASE 
            WHEN AnswerText = 'Yes' THEN "Often"
            WHEN AnswerText = 'Unsure' THEN "Unsure"
            WHEN AnswerText = 'No' THEN "Rarely"
            WHEN AnswerText IN (-1, 'Not applicable to me') THEN NULL
            ELSE AnswerText
        END AS affected_productivity
    FROM Answer 
    WHERE QuestionID = 54;
    """,
    db_engine,
)

q_productivity.affected_productivity = pd.Categorical(
    q_productivity.affected_productivity,
    categories=["Often", "Unsure", "Rarely"],
)

Merge the variables of interest into a long-format data frame (one row per user ID):

Code
ds = my.merge_all(
    [
        # -- Respondent profile
        q_gender,
        q_age,
        q_country_live,
        # -- Mental health status
        q_have_mhd,
        q_ever_diagnosed_mhd,
        q_sought_professional_treatment,
        # -- Company/Org./Role profile
        q_self_employed,
        q_company_size,
        q_is_tech_org,
        q_is_tech_it_role,
        # -- Mental health services at work
        q_employer_teach,
        q_employer_mental_healthcare,
        q_anonym_protection_to_treat,
        # -- Atmosphere for openness
        q_disclose_interview,
        q_discus_with_coworkers,
        q_discus_with_supervisor,
        # -- Productivity
        q_productivity,
    ]
)
Code
ds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4218 entries, 0 to 4217
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   user_id                         4218 non-null   int64   
 1   year                            4218 non-null   int64   
 2   gender                          4218 non-null   object  
 3   age                             4202 non-null   float64 
 4   age_group                       4202 non-null   object  
 5   age_group_2                     4202 non-null   object  
 6   country                         4216 non-null   object  
 7   country_major_ctb               4216 non-null   object  
 8   region                          4214 non-null   object  
 9   have_mhd                        2958 non-null   category
 10  ever_diagnosed_mhd              2095 non-null   category
 11  ever_sought_prof_mhd_treatment  4218 non-null   category
 12  self_employed                   4200 non-null   category
 13  company_size                    3714 non-null   object  
 14  company_size_3                  3714 non-null   object  
 15  is_tech_org                     3714 non-null   category
 16  is_tech_it_role                 1571 non-null   category
 17  employer_teach                  2454 non-null   category
 18  employer_mental_healthcare      3714 non-null   category
 19  anonym_protection_to_treat      3714 non-null   category
 20  disclose_in_interview           4218 non-null   category
 21  discus_with_coworkers           2454 non-null   category
 22  discus_with_supervisor          2454 non-null   category
 23  affected_productivity           453 non-null    category
dtypes: category(13), float64(1), int64(2), object(8)
memory usage: 417.9+ KB

3.2 Activity of Respondents

In this subsection, we will look into the questionnaire sizes used and the counts of respondents, who participated in each survey.

As there is no information in the database’s description on duplicate responses, thus in this analysis we assume that each response (i.e., a filled survey) equals the number of respondents.

Code
# N question and respondents' activity
df_y_q = pd.read_sql(
    """--sql
    WITH by_year AS (
         SELECT 
            SurveyID year, 
            COUNT(DISTINCT QuestionID) n_questions,
            COUNT(DISTINCT UserID) n_respondents
        FROM Answer 
        GROUP BY year
    )
    
    SELECT year, 
          n_questions, 
          n_respondents, 
          -- To treat result as float, 0.0 was added:
          ROUND((n_respondents + 0.0)/n_questions, 1) respondents_per_question
    FROM by_year;
    """,
    db_engine,
)
Code
# Plot 1:
plt.rc("figure", figsize=(7, 3))
fig1, ax1 = plt.subplots()

ax = df_y_q.plot(
    x="year",
    y="n_questions",
    kind="bar",
    color=green,
    ec="black",
    position=1,
    rot=0,
    width=0.3,
    legend=False,
    xlabel="Year",
    ylabel="n questions",
    ax=ax1,
)

ax1b = ax.twinx()

df_y_q.plot(
    x="year",
    y="respondents_per_question",
    kind="bar",
    color=blue,
    ec="black",
    position=0,
    width=0.3,
    legend=False,
    xlabel="Year",
    ylabel="n respondents per question\n(activity)",
    ax=ax1b,
)

ax1b.set_xlim(-0.5, 5 - 0.5)
fig1.legend(["Questions", "Activity"], loc="lower right", ncol=2, frameon=False)
plt.title("Questionnaire Size and Respondents' Activity\n over Time")

# Table:
df_y_q
year n_questions n_respondents respondents_per_question
0 2014 26 1260 48.5
1 2016 60 1433 23.9
2 2017 76 756 9.9
3 2018 76 417 5.5
4 2019 76 352 4.6

Code
# Plot 2:
plt.rc("figure", figsize=(7, 3))
fig1, ax1 = plt.subplots()

ax = df_y_q.plot(
    x="year",
    y="n_respondents",
    kind="bar",
    color=orange,
    ec="black",
    position=1,
    rot=0,
    width=0.3,
    legend=False,
    xlabel="Year",
    ylabel="n respondents (total)",
    ax=ax1,
)

ax1b = ax.twinx()

df_y_q.plot(
    x="year",
    y="respondents_per_question",
    kind="bar",
    color=blue,
    ec="black",
    position=0,
    width=0.3,
    legend=False,
    xlabel="Year",
    ylabel="n respondents per question",
    ax=ax1b,
)
ax1b.set_ylim(0, 57.3)

ax1b.set_xlim(-0.5, 5 - 0.5)
fig1.legend(["Total", "Per question"], loc="lower right", ncol=2, frameon=False)
plt.title("Absolute and Relative Respondents' Activity\n over Time");

Over time questionnaire size increased, but respondents’ activity decreased. The decreasing trend is visible for both absolute and relative activity. An exception is in the year 2016 when the absolute number of respondents increased. This might be related to the fact that in the year 2015, there was no poll, or to other factors, which were not investigated (e.g., poll advertising strategies).


3.3 Respondent’s Profile

In the surveys, there were more male (72%) than female (24%) respondents. The mean age was around 34 (with a standard deviation = 8) years old. Dominant countries of residence were the USA (62%) and the UK (11%). Dominant regions of residence were North America (67%) and Europe (26%). In the 2016 survey, roughly half and in the 2017-2019 surveys, almost all (>90%) respondents were those, who had ever had an MHD diagnosis, but at the time of the surveys, just around 40% identified themselves as currently having an MHD. The respondents WERE either self-employed or worked in various size companies. The companies were mainly (78%) from the tech industry. And the respondents were mainly (94%) working as tech/IT specialists.

Find the details below:

Code
f_tbl = my.calc_counts_and_percentages("gender", ds, sort=False)
my.plot_counts_with_labels(
    f_tbl, x_lab="", y_lab="Count", rot=0, title="Respondent's Gender"
)
f_tbl
gender n percent
0 Female 1025 24.3%
1 Male 3046 72.2%
2 Other/Unknown 147 3.5%

Code
plt.figure(figsize=[7, 3])
(
    sns.histplot(data=ds, x="age", bins=18).set(
        title="Respondent's Age", xlabel="Age, years"
    )
)
pd.DataFrame(round(ds.age.describe(), 1)).T
count mean std min 25% 50% 75% max
age 4202.0 33.9 8.1 18.0 28.0 33.0 38.0 74.0

Code
f_tbl = my.calc_counts_and_percentages("country_major_ctb", ds)
my.plot_counts_with_labels(
    f_tbl, x_lab="", y_lab="Count", title="Respondent's Country of Residence"
)
f_tbl
country_major_ctb n percent
0 USA 2604 61.8%
1 UK 482 11.4%
2 Other 472 11.2%
3 Canada 199 4.7%
4 Germany 136 3.2%
5 Netherlands 98 2.3%
6 Australia 73 1.7%
7 France 51 1.2%
8 Ireland 51 1.2%
9 India 50 1.2%

Code
f_tbl = my.calc_counts_and_percentages("region", ds)
my.plot_counts_with_labels(
    f_tbl,
    x_lab="",
    y_lab="Count",
    rot=0,
    title="Respondent's Region of Residence",
)
f_tbl
region n percent
0 North America 2804 66.5%
1 Europe 1111 26.4%
2 Asia 108 2.6%
3 Australasia 97 2.3%
4 Latin America 69 1.6%
5 Africa 25 0.6%

Code
print_questions(34)

c_tab = my.crosstab(ds, "year", "ever_diagnosed_mhd")
my.plot_crosstab_as_barplot(
    c_tab,
    title="Ever Been Diagnosed with MHD",
    xlabel="Survey",
    color=[orange, blue],
)
plt.gca().legend(frameon=False)
c_tab
Question:
ID Question
0 34 Have you ever been diagnosed with a mental health disorder?
ever_diagnosed_mhd Yes No
year
2016 716 717
2017 314 10
2018 189 2
2019 144 3

Code
print_questions(33)

c_tab = my.crosstab(ds, "year", "have_mhd")
my.plot_crosstab_as_barplot(
    c_tab,
    ylim=[None, 55],
    xlabel="Survey",
    title="Currently Have MHD",
    color=[orange, blue, green],
)
plt.gca().legend(frameon=False, ncol=3)
c_tab
Question:
ID Question
0 33 Do you currently have a mental health disorder?
have_mhd Yes No Don't know
year
2016 575 531 327
2017 324 222 210
2018 191 112 114
2019 147 104 101

Code
f_tbl = my.calc_counts_and_percentages("self_employed", ds, sort=False)
my.plot_counts_with_labels(
    f_tbl, x_lab="", rot=0, title="Self-Employed", color=[blue, orange]
)
f_tbl
self_employed n percent
0 Yes 650 15.5%
1 No 3550 84.5%

Code
f_tbl = my.calc_counts_and_percentages("company_size", ds, sort="index")
my.plot_counts_with_labels(
    f_tbl, x_lab="", rot=15, title="Company/Organization Size"
)
f_tbl
company_size n percent
0 A. 1-5 254 6.8%
1 B. 6-25 689 18.6%
2 C. 26-100 824 22.2%
3 D. 100-500 788 21.2%
4 E. 500-1000 247 6.7%
5 F. More than 1000 912 24.6%

Code
f_tbl = my.calc_counts_and_percentages("is_tech_org", ds, sort="index")
my.plot_counts_with_labels(
    f_tbl,
    x_lab="",
    rot=0,
    title="Is Mainly Tech Company/Organization?",
    color=[blue, orange],
)
f_tbl
is_tech_org n percent
0 Yes 2888 77.8%
1 No 826 22.2%

Code
f_tbl = my.calc_counts_and_percentages("is_tech_it_role", ds)
my.plot_counts_with_labels(
    f_tbl,
    x_lab="",
    rot=0,
    title="Works in Tech/IT Role",
    color=[blue, orange],
)
f_tbl
is_tech_it_role n percent
0 Yes 1471 93.6%
1 No 100 6.4%


3.3.1 Mental Health and Productivity

Approximately 81% of respondents (out of 453) think that their mental illness affects their productivity often. Furthermore, based on the answers of 2031 respondents, productivity is less interfered, if the mental health issues are treated effectively.

This information was revealed by analyzing the answers to the following 3 questions:

Code
print_questions(54)
Question:
ID Question
0 54 Do you believe your productivity is ever affected by a mental health issue?
Code
f_tbl = my.calc_counts_and_percentages("affected_productivity", ds)
my.plot_counts_with_labels(
    f_tbl,
    x_lab="",
    rot=0,
    title="How Often Productivity Is Affected\nby Mental Illness",
    color=[orange, blue, blue],
)
f_tbl
affected_productivity n percent
0 Often 367 81.0%
1 Unsure 60 13.2%
2 Rarely 26 5.7%

Code
print_questions((48, 49))
Questions:
ID Question
0 48 If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?
1 49 If you have a mental health disorder, how often do you feel that it interferes with your work when not being treated effectively (i.e., when you are experiencing symptoms)?
Code
# -- Group 'Not applicable to me' was excluded
q_48_49 = pd.read_sql(
    """--sql
     WITH recoded_48_49 AS (
        SELECT
            UserID user_id,
            SurveyID year,
            CASE 
                WHEN AnswerText = 'Often' THEN "D. Often"
                WHEN AnswerText = 'Sometimes' THEN "C. Sometimes"
                WHEN AnswerText = 'Rarely' THEN "B. Rarely"
                WHEN AnswerText = 'Never' THEN "A. Never"
                WHEN AnswerText = 'Not applicable to me' THEN NULL
                ELSE AnswerText
            END AS productivity_interference,
            QuestionID
        FROM Answer 
        WHERE QuestionID IN (48, 49)
    )

    SELECT user_id, year, 'Treated' AS treatment, productivity_interference
    FROM recoded_48_49 
    WHERE QuestionID = 48

    UNION ALL

    SELECT user_id, year, 'Untreated' AS treatment, productivity_interference
    FROM recoded_48_49 
    WHERE QuestionID = 49;
    """,
    db_engine,
)
# N respondents
#     Non-relevant values were removed, the same people answered to both
#     questions on situation while being treated and without treatment.
n = q_48_49.dropna()["user_id"].nunique()

# Absolute counts
ct_abs = my.crosstab(q_48_49, "treatment", "productivity_interference")

# Plot
plt.rc("figure", figsize=(7, 3))
my.plot_crosstab_as_barplot(
    ct_abs,
    title="Mental Illness and Interference with Productivity",
    ylim=[None, 70],
    color=[green, blue, orange, red],
)
plt.legend(frameon=False, loc="upper left")

# Output
print(
    f"After removing non-relevant answers, n={n} respondents were included.\n"
    "Cross-tabulation:"
)
ct_abs["Total"] = ct_abs.sum(numeric_only=True, axis=1)
display(ct_abs)

print("Percentage of group (treated/untreated) totals:")
After removing non-relevant answers, n=2031 respondents were included.
Cross-tabulation:
Percentage of group (treated/untreated) totals:
productivity_interference A. Never B. Rarely C. Sometimes D. Often Total
treatment
Treated 165 700 808 166 1839
Untreated 24 113 672 1183 1992


4 Take Away Points

In this project, Mental Health in the Tech Industry dataset, which covers 2014-2019 year surveys, was explored. The findings are most applicable to the regions of North America and Europe especially the USA, as most of the respondents were from there.

  • The results show that effective MHD treatment may increase productivity at work.
  • In order to get professional treatment, people should look for professional help.
    • The factors positively related to seeking MHD help:
      • mental health benefits as part of healthcare coverage,
      • increased anonymity to treat MHD,
      • work in a bigger company,
      • older respondent’s age, and
      • provided education on MHD.
    • The factors that were irrelevant (no relationship found):
      • working in a technical/non-technical role or company.
  • More effort should be placed on improving the situation of the following groups as they are less prone to look for MHD-related help:
    • males,
    • younger people, and
    • Europeans (compared to North Americans).
  • Employees are more prone to reveal MHD issues to managers than to coworkers. So managers have more influence on the ways to increase productivity.
  • It seems that the situation is getting better over the years, but there is much to do to reach an optimal point.

Suggestions to Improve the Analysis

  1. A broader analysis that includes more questions and more relationships between the answers could be performed, e.g.:
    • Factors such as race were not included but might be important.
    • Not only general MHD but also subtypes of it should be investigated.
  2. In some analyses, more emphasis could be put on the cases, where the sample size or response rate is small, to indicate that these results are less reliable.