The Analysis of Apple Podcasts Reviews

Data Analysis Project

Author

Vilmantas Gėgžna

Published

2023-01-04

Updated

2023-07-29

Project logo. Generated with Leonardo.Ai.

Tools: Python, SQL, R, Looker Studio
Helper tools: VSCode, Quarto, Git
Skills:

Technical requirements:

Abbreviations

  • CI – 95% confidence interval.
  • CLD - compact letter display.
  • gof, GOF – goodness of fit.
  • n – either sample size or group size.
  • ns – not significant.
  • p – p-value.
  • p_adj – p-value (adjusted).
  • r – correlation coefficient (Spearman’s).
  • UTC – coordinated universal time.
  • wo – without.
  • π – pi (proportion).
  • χ² – chi-squared.

1 Introduction

A podcast is a series of spoken word episodes, all focused on a particular topic or theme, like cycling or startups (source: The Podcast Host). It is a program made available in digital format either for download over the Internet (source: Wikipedia) or watching/listening online. Apple Podcasts is one of the prominent platforms dedicated to hosting podcasts. This project is focused on the analysis of podcasts (their categories, reviews and ratings) hosted on this platform.

1.1 Purpose

This analysis aims to investigate:

  1. the number of reviews as well as
  2. ratings of podcasts

and their

  • temporal trends and patterns as well as
  • relationships between these two variables and features including various podcasts’ titles review titles and review body length metrics and number of categories/sub-categories per podcast.
  1. popularity of podcast categories.

2 Methods

For statistical inference, significance level \(\alpha=0.05\) and 95% confidence level are used.

2.1 Population and Sample

It is assumed that the data is a simple random sample or its equivalent that allows using inferential statistics methods. In most cases, where various features of reviews are analyzed, the population can be defined as:

  • reviews of podcasts on the Apple Podcasts platform.

In cases, where metrics aggregated for each podcast are used, the population can be defined as:

  • podcasts on the Apple Podcasts platform.

Lastly, in the analysis of categories, as the same podcast can have several categories, it was assumed that category is a separate independent observational unit (and category labels as well as podcast information were treated as features of each category). In this case, the population can be defined as:

  • podcasts categories on the Apple Podcasts platform.

2.2 Differences Between Groups

For statistical inference about several groups (categories), the following strategy is used:

  • first, 95% confidence intervals (CI) for each group are calculated,
  • second, an omnibus test is performed,
  • third, when the result of the omnibus statistical test is significant, a post-hoc analysis (pair-wise comparisons) is performed.

In post-hoc analysis, \(p\) values are adjusted with Holm correction to control the inflated probability of type 1 error. What is more, the results of pair-wise comparisons are compactly presented as CLD¹ (compact letter display) representation of adjusted \(p\) values.

¹ IMPORTANT: In CLD representation of the results, a letter shared by several groups (e.g., control and treatment) indicates that the results are statistically insignificant (\(p\ge\alpha\)), and lack of shared letter shows statistically significant results. The main advantage of CLD is that instead of \(\frac{k \cdot (k-1)}{2}\) \(p\) values in each-to-each group pair-wise comparisons we get \(k\) CLD representations of the same results, i.e., results are easier to understand if many groups are compared. Here \(k\) is the number of groups.

  1. In the analysis of differences between group/category counts
    (e.g., the analysis of the number of reviews per certain time period),
    the following methods are used:

    • plot: bar chart with absolute counts on the y-axis and percentage values as bar labels.
    • confidence intervals: Goodman’s simultaneous confidence intervals of proportions;
    • omnibus: Pearson’s chi-square (χ²) goodness-of-fit (GOF) test,
      • hypotheses:
        \(H_0:\) All proportions are equal: \(~ \pi_1 = \pi_2 = ... = \pi_i = ... = \pi_k\)
        \(H_1:\) At least two proportions differ: \(\pi_i \ne \pi_j\) for at least single pair of i and j.
    • post-hoc: pair-wise chi-square χ² GOF test (with Holm correction),
      • hypotheses:
        \(H_0:\) Both proportions are equal: \(~ \pi_i = \pi_j\),
        \(H_1:\) Proportions differ: \(~ \pi_i \ne \pi_j\).

    Here \(\pi\) (pi) is a proportion (relative frequency, percentage) of values in a certain group,
    \(i\) and \(j\) are group indices (\(i\) = 1, 2, …, \(k\); \(j\) = 1, 2, …, \(k\); \(i \ne j\)),
    \(k\) – total number of groups.

  2. In the analysis of differences between continuous/numeric data groups:
    (e.g., the analysis of ratings grouped by time periods),
    the following methods are used:

    • plot: dot/scatter plot where dots represent means and error bars indicate 95% CI. As in most cases the sample size is very large, and most errors are small, so error bars may not be visible.
    • confidence intervals: t-distribution-based confidence intervals for mean;
    • omnibus: non-parametric Kruskal-Wallis test,
      • hypotheses:
        \(H_0:\) No stochastic dominance among the groups,
        \(H_1:\) At least one group stochastically dominates one other group.
    • post-hoc: non-parametric Conover-Iman test (with Holm correction),
      • hypotheses:
        \(H_0:\) No stochastic dominance between the two groups,
        \(H_1:\) One group stochastically dominates the other group.

    The term “stochastic dominance” means that one of the groups tends to have (statistically) larger values than some other group.

2.3 Correlation Between Variables

For the relationship between numeric variables, non-parametric Spearman’s rank correlation analysis is used.

  • Hypotheses:
    \(H_0:\) \(\rho_s = 0\) (variables do not correlate),
    \(H_1:\) \(\rho_s \ne 0\) (variables correlate).

    Here \(\rho_s\) (rho) is the population Spearman’s correlation coefficient.

In temporal patterns and trends analysis, no p-value correction is applied. And for the correlation analyses where variables of the number of reviews and ratings are used several times, to prevent an inflated type 1 error rate, Holm correction is applied.

3 Preparation and Inspection

In this section, statistical software is prepared for the analysis, data are imported into statistical software, inspected, and pre-preprocessed to be ready for a deeper statistical analysis.

3.1 Setup

Code
# Automatically reload certain modules
%reload_ext autoreload
%autoreload 1

%aimport functions


# Packages and modules -------------------------------
import os
import warnings

# To connect SQL database
import sqlalchemy as sql

# Data wrangling, math
import numpy as np
import pandas as pd

# Statistical analysis
import scipy.stats as sps
import statsmodels.stats.api as sms
import pingouin as pg
import scikit_posthocs as sp

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Custom functions
import functions as my

# Enable ability to run R in Python code cells
os.environ["R_HOME"] = "C:/PROGRA~1/R/R-4.3.1"

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    import rpy2

    %load_ext rpy2.ipython

# Settings --------------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(7, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 40)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.2f}")
pd.set_option("styler.format.thousands", ",")

# colors
green, blue, orange, red = "tab:green", "tab:blue", "tab:orange", "tab:red"

In this section, ad-hoc (specialized for this analysis) functions are defined. The more generalizable functions can be found in functions.py and functions.R files of this project.

Code
# Main analysis functions


def analyze_counts(
    counts,
    by: str,
    xlabel=None,
    ylabel=None,
    counts_of: str = "review",
    **kwargs,
):
    """The main function to analyze count data

    - Performs omnibus chi-squared and post-hoc pair-wise chi-squared test.
    - Compactly presents results of post-hoc test as CLD
      (compact letter display, where shared CLD letter show no significant
      difference between groups).
      NOTE: cld computations require R.
    - Calculates percentages and their confidence intervals by using Goodman's
      method.
    - Creates summary of grouped values (group counts and percentages).
    - Plots results as bar plots with percentage labels.

    Args:
        counts (pandas.Series[int]): Count data to analyze.
        by (str): Grouping variable name. Used to create labels.
        xlabel (str, None, optional): X axis label.
                Defaults to None: autogenerated label.
        ylabel (str, None, optional): Y axis label.
                Defaults to None: autogenerated label.
        counts_of (str, optional): The thing that was counted. Used for labels.
                Defaults to "review".
        **kwargs: further arguments passed to `my.plot_counts_with_labels()`
    """
    # Omnibus test
    my.display_collapsible(
        my.test_chi_square_gof(counts), "Omnibus (chi-squared) test results"
    )

    # Post-hoc pairwise chi-square
    posthoc_p = my.pairwise_chisq_gof_test(counts)
    posthoc_cld = my.convert_pairwise_p_to_cld(posthoc_p, output_gr_var=by)

    # Confidence interval
    n_label = f"n_{counts_of}s"
    ci = (
        my.ci_proportion_multinomial(counts, method="goodman", n_label=n_label)
        .rename_axis(by)
        .reset_index()
    )

    # Make sure datasets are mergeable
    ci[by] = ci[by].astype(str)
    posthoc_cld[by] = posthoc_cld[by].astype(str)

    # Merge results
    n_per_category = pd.merge(ci, posthoc_cld, on=by)

    # Descriptive statistics: calculate
    to_format = ["min", "max", "range", "mean", "median", "std", "mad"]

    format_0f = lambda x: [f"{i:,.0f}" for i in x]

    summary_count = my.calc_summaries(ci[n_label])
    summary_count[to_format] = summary_count[to_format].apply(format_0f)

    summary_perc = my.calc_summaries(ci["percent"])
    summary_perc[to_format] = summary_perc[to_format].apply(my.format_percent)

    # Format percentages and counts
    vars = ["percent", "ci_lower", "ci_upper"]
    n_per_category[vars] = n_per_category[vars].apply(my.format_percent)

    my.display_collapsible(
        n_per_category.style.format({n_label: "{:,.0f}"}),
        f"{counts_of.capitalize()} counts with 95% CI and post-hoc "
        " (pairwise chi-squared) test results",
    )

    # Descriptive statistics: display
    my.display_collapsible(
        pd.concat([summary_count, summary_perc]),
        f"Descriptive statistics of group ({by}) counts",
    )

    # Plot
    if xlabel is None:
        xlabel = by.capitalize()

    if ylabel is None:
        ylabel = f"Number of {counts_of}s"

    my.plot_counts_with_labels(
        n_per_category, x_lab=xlabel, y_lab=ylabel, y=n_label, **kwargs
    )
    my.ax_axis_comma_format("y")

    # Output
    return n_per_category


def analyze_numeric_groups(
    y: str, by: str, data, title=None, xlabel=None, ylabel=None
):
    """The main function to analyze numeric/continuous data by groups

    - Calculates mean ratings per group and their confidence intervals using
      t distribution.
    - Performs omnibus (Kruskal-Wallis) and post-hoc (Conover-Iman) tests.
    - Compactly presents results of post-hoc test as CLD
      (compact letter display, where shared CLD letter show no significant
      difference between groups).
      NOTE: cld computations require R.
    - Creates summary of grouped values (group counts and percentages).
    - Plots results as points with 95% confidence interval error bars.

    Args:
        y (str): Name of numeric/continuous (dependent) variable.
        by (str): Name of grouping (independent) variable.
        data (pandas.DataFrame): data frame with variables indicated in
             `y` and `by`.
        xlabel (str, None, optional): X axis label.
                Defaults to None: capitalized value of `by`.
        ylabel (str, None, optional): Y axis label.
                Defaults to None: capitalized value of `y`.
        title (str, None, optional): The title of the plot.
                Defaults to None.
    """

    # Omnibus test: Kruskal-Wallis test
    stat_rez = pg.kruskal(data=data, dv=y, between=by)
    stat_rez["p-unc"] = my.format_p(stat_rez["p-unc"][0])

    my.display_collapsible(stat_rez, "Omnibus (Kruskal-Wallis) test results")

    # Confidence intervals
    ci_raw = data.groupby(by)[y].apply(
        lambda x: [np.mean(x), *sms.DescrStatsW(x).tconfint_mean()]
    )
    ci = pd.DataFrame(
        list(ci_raw),
        index=ci_raw.index,
        columns=["mean", "ci_lower", "ci_upper"],
    ).reset_index()

    # Post-hoc test: Conover-Iman test
    posthoc_p_matrix = sp.posthoc_conover(
        data, val_col=y, group_col=by, p_adjust="holm"
    )
    posthoc_p_df = my.as_df(
        posthoc_p_matrix.stack(), ["group1", "group2"], "p.adj"
    )
    posthoc_cld = my.convert_pairwise_p_to_cld(posthoc_p_df, output_gr_var=by)

    # Make sure datasets are mergeable
    ci[by] = ci[by].astype(str)
    posthoc_cld[by] = posthoc_cld[by].astype(str)

    ci_and_cld = pd.merge(posthoc_cld, ci, on=by)

    # Display CI and CLD
    my.display_collapsible(
        ci_and_cld,
        "Post-hoc (Conover-Iman) test results as CLD and "
        "Confidence intervals (CI) of ratings",
    )

    # Descriptive statistics of means
    my.display_collapsible(
        my.calc_summaries(ci["mean"]),
        f"Descriptive statistics of group ({by}) means",
    )

    # Plot
    fig, ax = plt.subplots()

    x = ci.iloc[:, 0]

    ax.errorbar(
        x=x,
        y=ci["mean"],
        yerr=[ci["mean"] - ci["ci_lower"], ci["ci_upper"] - ci["mean"]],
        mfc="red",
        ms=2,
        mew=1,
        fmt="ko",
        zorder=3,
    )

    ax.axhline(
        y=ci["mean"].mean(),
        color="lightgray",
        linestyle="--",
        label="Reference line (mean of means)",
        zorder=1,
    )

    if xlabel is None:
        xlabel = by.capitalize()
    if ylabel is None:
        ylabel = y.capitalize()
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_ylim([4, 5])
    ax.set_title(title)

    ax.legend(frameon=False, loc="lower right")

    # Output
    return ci_and_cld


def analyze_ratings(
    by, title=None, xlabel=None, ylabel=None, y="rating", data=None
):
    """The main function to analyze ratings

    See the description of function `analyze_numeric_groups()` for more details.
    The environment, in which function is called, must have variable
    (data frame) `data_reviews`.

    """
    if data is None:
        data = data_reviews
    return analyze_numeric_groups(
        y=y, by=by, data=data, title=title, xlabel=xlabel, ylabel=ylabel
    )


def plot_scatter(data_x, data_y, by, corr=None, counts_of: str = "review"):
    """Plot relationship between mean rating and number of reviews
       aggregated by group

       This is convenience function to merge and plot the results of
       functions `analyze_counts()` and `analyze_ratings()`.
    Args:
        data_x (pandass.Dataframe): Data frame with required data.
        data_y (pandass.Dataframe): Data frame with required data.
        by (str): variable used for aggregation.
        corr (str, None, optional): if "Spearman", the results of
             Spearman's correlation analysis are printed.
        counts_of (str, optional): The thing that was counted.
                Used for creating variable name and labels.
                Defaults to "review".
    """
    assert corr in [
        "Spearman",
        None,
    ], f"Unknown value for argument `corr`: {corr}"

    x_var = f"n_{counts_of}s"

    data_merged = pd.merge(data_x, data_y, on=by)
    data_merged.plot.scatter(
        x=x_var,
        y="mean",
        figsize=(3.5, 3),
        rot=45,
        xlabel=f"Number of {counts_of}s",
        ylabel="Average rating",
        title=f"Relationship Between Metrics \n"
        f"Aggregated by {by.capitalize()}",
    )
    my.ax_axis_comma_format("x")

    r, p = sps.spearmanr(data_merged[x_var], data_merged["mean"])

    if corr == "Spearman":
        print(f"Spearman's correlation: r = {r:.2f}, {my.format_p(p)}")


def do_corr_1_vs_other(data, var_x, corr_result=None):
    """Perform Spearman correlation analysis in 1-vs-remaining style

    Perform Spearman correlation between the selected and all the remaining
    numeric variables in the dataset.

    Args:
        data (pandas.DataFrame): Data frame with variables to analyze.
        var_x (str): variable used as the first variable in correlation
                     analysis.
        corr_result (list or None): List to collect the results of analysis.
                    Defaults to None.

    Returns:
        list: `corr_result` with appended results
    """
    # Initiate list, if not provided
    if corr_result is None:
        corr_result = []

    # The second column (var_y) must be numeric
    columns_to_include = (
        data.select_dtypes("number").drop(var_x, axis=1).columns
    )
    # Do analysis and collect the results
    for var_y in columns_to_include:
        if var_x == var_y:
            continue
        r, p = sps.spearmanr(data[var_x], data[var_y], nan_policy="omit")
        corr_result.append(
            {"variable_1": var_x, "variable_2": var_y, "r": r, "p": p}
        )
    # Return results
    return corr_result

3.2 Database

The database is downloaded from Kaggle to the directory data and renamed to podcast-reviews-database-v27.sqlite. At the moment of the analysis, the newest version of the database was v27. It consists of 4 tables:

  • runs: facts on database updates.
  • podcasts: podcast names, URLs, and similar data.
  • categories: data on categories of podcasts.
  • reviews: reviews and ratings of podcasts.
Code
# Create a connection to the database.
db_engine = sql.create_engine(
    "sqlite:///data/podcast-reviews-database-v27.sqlite"
)
Code
# List tables
sql.inspect(db_engine).get_table_names()
['categories', 'podcasts', 'reviews', 'runs']

Entity relationship diagram (ERD) of this database:

Code

The following code was used to create the ERD:

eralchemy2 -i sqlite:///data/podcast-reviews-database-v27.sqlite \
           -o img/podcast-reviews-database-v27--erd.jpg

Entity relationship diagram of Apple Podcasts Review database on Kaggle.

As metadata is poor in this database, there are no connections drawn between the tables. The following sections will inspect, explore and analyze the data in more detail.

3.3 Data Import and Inspection

This section deals with individual (not merged) data frames.

3.3.1 Import and Inspection: “runs”

The runs dataset contains information on when the database was last updated.

Code
data_runs = pd.read_sql("SELECT * FROM runs;", db_engine)
print(f'The database was last updated on: {data_runs["run_at"].max()}')
The database was last updated on: 2022-12-10 01:17:07

3.3.2 Import and Inspection: “podcasts”

The podcass table contains data on 109,602 podcasts. There are 745 podcasts with a non-unique name. No other discrepancies were detected.

Code
data_podcasts = pd.read_sql("SELECT * FROM podcasts;", db_engine)
Code
data_podcasts.shape
(109602, 5)
Code
data_podcasts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109602 entries, 0 to 109601
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   podcast_id  109602 non-null  object
 1   itunes_id   109602 non-null  int64 
 2   slug        109602 non-null  object
 3   itunes_url  109602 non-null  object
 4   title       109602 non-null  object
dtypes: int64(1), object(4)
memory usage: 4.2+ MB
Code
data_podcasts.isna().sum()
podcast_id    0
itunes_id     0
slug          0
itunes_url    0
title         0
dtype: int64
Code
data_podcasts.head()
podcast_id itunes_id slug itunes_url title
0 a00018b54eb342567c94dacfb2a3e504 1313466221 scaling-global https://podcasts.apple.com/us/podcas... Scaling Global
1 a00043d34e734b09246d17dc5d56f63c 158973461 cornerstone-baptist-church-of-orlando https://podcasts.apple.com/us/podcas... Cornerstone Baptist Church of Orlando
2 a0004b1ef445af9dc84dad1e7821b1e3 139076942 mystery-dancing-in-the-dark https://podcasts.apple.com/us/podcas... Mystery: Dancing in the Dark
3 a00071f9aaae9ac725c3a586701abf4d 1332508972 kts-money-matters https://podcasts.apple.com/us/podcas... KTs Money Matters
4 a000a500f06555f81220c3eb641aded7 1544900779 word-on-the-street-w-dreak-swift https://podcasts.apple.com/us/podcas... Word on the Street w/ Dreak Swift
Code
unique_podcast_count = my.count_unique(data_podcasts)
unique_podcast_count
n_unique percent_unique
podcast_id 109602 100.0%
itunes_id 109602 100.0%
slug 108503 99.0%
itunes_url 109602 100.0%
title 108857 99.3%
Code
# Count podcasts with non-unique name
n_unique_itunes_id = unique_podcast_count.loc["itunes_id", "n_unique"]
n_unique_title = unique_podcast_count.loc["title", "n_unique"]

n_unique_itunes_id - n_unique_title
745

3.3.3 Import and Inspection: “categories”

The categories dataset contains data on podcast categories. There are 211,760 entries for categories (one row per podcast-category combination) and 109,602 unique podcasts (the same number as in the previous dataset).

Before importing into Python, some similar categories were merged to create a new variable category. The old variable with categories was renamed to sub_category (i.e., these categories no are treated as sub-categories). So there are 19 categories and 110 sub-categories in the dataset.

Code
data_categories = pd.read_sql_query(
    """--sql
    SELECT 
        c.podcast_id,
        CASE
            WHEN c.category LIKE 'arts%' THEN 'arts'
            WHEN c.category LIKE 'business%' THEN 'business'
            WHEN c.category LIKE 'comedy%' THEN 'comedy'
            WHEN c.category LIKE 'true-crime%' THEN 'crime'
            WHEN c.category LIKE 'education%' THEN 'education'
            WHEN c.category LIKE 'fiction%' THEN 'fiction'
            WHEN c.category LIKE 'tv-film%' THEN 'tv/film'
            WHEN c.category LIKE 'government%' THEN 'government'
            WHEN c.category LIKE 'health-fitness%' THEN 'health/fitness'
            WHEN c.category LIKE 'history%' THEN 'history'
            WHEN c.category LIKE 'kids-family%' THEN 'kids/family'
            WHEN c.category LIKE 'leisure%' THEN 'leisure'
            WHEN c.category LIKE 'music%' THEN 'music'
            WHEN c.category LIKE 'news%' THEN 'news'
            WHEN c.category LIKE 'science%' THEN 'science'
            WHEN c.category LIKE 'society-culture%' THEN 'society/culture'
            WHEN c.category LIKE 'sports%' THEN 'sports'
            WHEN c.category LIKE 'technology%' THEN 'technology'
            WHEN c.category LIKE 'religion%' OR
                 c.category LIKE 'spirituality%' OR
                 c.category LIKE 'buddhism%' OR
                 c.category LIKE 'christianity%' OR
                 c.category LIKE 'hinduism%' OR
                 c.category LIKE 'islam%' OR
                 c.category LIKE 'judaism%' THEN 'religion/spirituality' 
            END AS category,
        c.category as sub_category
    FROM categories AS c;
    """,
    db_engine,
)
Code
data_categories.shape
(211760, 3)
Code
data_categories.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211760 entries, 0 to 211759
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   podcast_id    211760 non-null  object
 1   category      211760 non-null  object
 2   sub_category  211760 non-null  object
dtypes: object(3)
memory usage: 4.8+ MB
Code
data_categories.isna().sum()
podcast_id      0
category        0
sub_category    0
dtype: int64
Code
data_categories.head()
podcast_id category sub_category
0 c61aa81c9b929a66f0c1db6cbe5d8548 arts arts
1 c61aa81c9b929a66f0c1db6cbe5d8548 arts arts-performing-arts
2 c61aa81c9b929a66f0c1db6cbe5d8548 music music
3 ad4f2bf69c72b8db75978423c25f379e arts arts
4 ad4f2bf69c72b8db75978423c25f379e arts arts-design
Code
my.count_unique(data_categories)
n_unique percent_unique
podcast_id 109602 51.8%
category 19 <0.1%
sub_category 110 0.1%

3.3.4 Import and Inspection: “reviews”

There are 2,025,333 entries in the dataset and 652 are identified as duplicates. The number of unique review authors (n = 1,451,063) is smaller than the number of unique podcasts (the ratio of the number of podcasts to the number of unique authors is ~0.7, which means that, on average, one in 2 users creates a single review and every second user creates at least 2 reviews). The number of unique podcasts is 111,125 and this number is higher than in previous datasets. This should be investigated in more detail (see the section on several datasets).

All timestamps of reviews are in UTC-07:00 timezone. Some dates were in a strange format/language (e.g., ٢٠٢٠-١٢-٢٤T١١:٥٥:٠٨-07:00) but Python functions manage to automatically standardize even this type of dates.

No other discrepancies were found.

Code
data_reviews = pd.read_sql("SELECT * FROM reviews;", db_engine)
Code
data_reviews.shape
(2025333, 6)
Code
data_reviews.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2025333 entries, 0 to 2025332
Data columns (total 6 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   podcast_id  object
 1   title       object
 2   content     object
 3   rating      int64 
 4   author_id   object
 5   created_at  object
dtypes: int64(1), object(5)
memory usage: 92.7+ MB
Code
data_reviews.isna().sum()
podcast_id    0
title         0
content       0
rating        0
author_id     0
created_at    0
dtype: int64
Code
data_reviews.head(2)
podcast_id title content rating author_id created_at
0 c61aa81c9b929a66f0c1db6cbe5d8548 really interesting! Thanks for providing these insights.... 5 F7E5A318989779D 2018-04-24T12:05:16-07:00
1 c61aa81c9b929a66f0c1db6cbe5d8548 Must listen for anyone interested in... Super excited to see this podcast gr... 5 F6BF5472689BD12 2018-05-09T18:14:32-07:00
Code
my.count_unique(data_reviews)
n_unique percent_unique
podcast_id 111125 5.5%
title 1117416 55.2%
content 2008001 99.1%
rating 5 <0.1%
author_id 1451063 71.6%
created_at 2012360 99.4%

In the reviews table, there are duplicated records. They will be removed in the pre-processing step.

Code
review_duplicates = data_reviews.duplicated()
print(f"There are review duplicates: {review_duplicates.any()}")
print(f"Number of review duplicates: {review_duplicates.sum()}")
There are review duplicates: True
Number of review duplicates: 652

The time zone of all review timestamps is -07:00:

Code
tz = data_reviews["created_at"].str.extract(r"(-\d{2}:\d{2})$")
tz.value_counts()
-07:00    2025333
Name: count, dtype: int64

Some dates are in a strange format (see the largest values):

Code
data_reviews["created_at"].agg([min, max])
min    2005-12-09T21:41:01-07:00
max    ٢٠٢٠-١٢-٢٤T١١:٥٥:٠٨-07:00
Name: created_at, dtype: object
Code
sorted_dates = data_reviews["created_at"].sort_values()

print("\nLargest timestamps of reviews: \n")
print(sorted_dates.tail())

Largest timestamps of reviews: 

2025300    2022-12-08T07:56:21-07:00
2025302    2022-12-08T11:14:54-07:00
2025326    2022-12-08T12:12:47-07:00
1136903    ٢٠٢٠-١٢-٢٤T٠٨:٢٦:٣٦-07:00
1136904    ٢٠٢٠-١٢-٢٤T١١:٥٥:٠٨-07:00
Name: created_at, dtype: object

It seems that conversion to Pandas datetime format solves this issue.

Code
print("The same timestamps after the conversion to datetime format: ")
pd.to_datetime(sorted_dates.tail(), format='mixed')
The same timestamps after the conversion to datetime format: 
2025300   2022-12-08 07:56:21-07:00
2025302   2022-12-08 11:14:54-07:00
2025326   2022-12-08 12:12:47-07:00
1136903   2020-12-24 08:26:36-07:00
1136904   2020-12-24 11:55:08-07:00
Name: created_at, dtype: datetime64[ns, UTC-07:00]

3.4 Pre-Processing of Individual Data Frames

This section deals with individual (not merged) data frames. Data frame “runs” was not included in the further analysis so there is no dedicated sub-subsection for it.

3.4.1 Pre-Processing: “podcasts”

Information on slug and podcast title length metrics were extracted as additional features:

  • For slug:
    1. number of characters,
    2. number of hyphens,
    3. number of characters without hyphens,
    4. average word length.
  • For podcast title:
    1. number of characters,
    2. number of words,
    3. average word length.
Code
# Here word is substring that contains digits, letters, underscores and hyphens
# (when hyphens are in the middle of the word)
word_pattern = r"\w+((-\w)*\w)*"

data_podcasts = data_podcasts.rename({"title": "podcast_title"}, axis=1)

# Info on slug
data_podcasts["slug_n_char"] = data_podcasts["slug"].apply(len)
data_podcasts["slug_n_hyphen"] = data_podcasts["slug"].str.count("-")
data_podcasts["slug_n_char_wo_hyphen"] = data_podcasts.eval(
    "slug_n_char - slug_n_hyphen"
)
data_podcasts["slug_avg_word_length"] = data_podcasts.eval(
    "slug_n_char_wo_hyphen / (slug_n_hyphen + 1)"
)

# Info on podcast title
data_podcasts["podcast_title_n_char"] = data_podcasts["podcast_title"].apply(
    len
)
data_podcasts["podcast_title_n_words"] = data_podcasts[
    "podcast_title"
].str.count(word_pattern)
data_podcasts["podcast_title_avg_word_length"] = data_podcasts.eval(
    "podcast_title_n_char / podcast_title_n_words"
)
Code
data_podcasts.head().loc[:, "slug_n_char":"podcast_title_avg_word_length"]
slug_n_char slug_n_hyphen slug_n_char_wo_hyphen slug_avg_word_length podcast_title_n_char podcast_title_n_words podcast_title_avg_word_length
0 14 1 13 6.50 14 2 7.00
1 37 4 33 6.60 37 5 7.40
2 27 4 23 4.60 28 5 5.60
3 17 2 15 5.00 17 3 5.67
4 32 6 26 3.71 33 7 4.71

3.4.2 Pre-Processing: “categories”

  1. The step of merging categories was performed before importing this dataset into Python (find it in the section of “categories” data import and inspection).
  2. Now one additional dataset for category counts will be created: this dataset will contain a single row per podcast.
Code
n_categories_per_podcast = pd.concat(
    [
        # n categories per podcast
        data_categories.drop_duplicates(subset=["podcast_id", "category"])[
            "podcast_id"
        ]
        .value_counts()
        .rename("n_categories"),
        # n sub-categories per podcast
        data_categories["podcast_id"].value_counts().rename("n_subcategories"),
    ],
    axis=1,
).rename_axis("podcast_id")

# Preview
n_categories_per_podcast.head(2).reset_index()
podcast_id n_categories n_subcategories
0 c7c3acaf63bd2e97631c2701c135ae38 5 7
1 c5c2b1d7f6f0fc30f0e120a42ac7c931 5 5

3.4.3 Pre-Processing: “reviews”

The main steps in “reviews” pre-precessing:

  1. Duplicated entries are removed.
  2. Timestamps are converted from string to Python date-time format.
  3. Year, month, weekday name, and hour of the day, when a podcast was created, are extracted as separate variables to investigate temporal trends and patterns.
  4. Review title and review body length metrics (number of characters, number of words and average word length as well as ratios of review body to review title length metrics were calculated) are calculated.
Code
# Remove duplicates
data_reviews = data_reviews.drop_duplicates()

# Convert to datetime format
data_reviews["created_at"] = (
    pd.to_datetime(data_reviews["created_at"], format='mixed')
)

# Extract information from dates
data_reviews["year"] = data_reviews["created_at"].dt.year
data_reviews["month"] = data_reviews["created_at"].dt.month
data_reviews["weekday"] = pd.Categorical(
    data_reviews["created_at"].dt.day_name(),
    categories=[
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
        "Sunday",
    ],
    ordered=True,
)
data_reviews["hour"] = data_reviews["created_at"].dt.hour
Code
# To remove ambiguity in merged dataset
data_reviews = data_reviews.rename({"title": "review_title"}, axis=1)
Code
# Metrics of review title length
data_reviews["review_title_n_char"] = data_reviews["review_title"].apply(len)

data_reviews["review_title_n_words"] = data_reviews["review_title"].str.count(
    word_pattern
)
data_reviews["review_title_avg_word_length"] = data_reviews.eval(
    "review_title_n_char / review_title_n_words"
).replace([np.inf, np.nan], 0)

# Metrics of review body length
data_reviews["review_body_n_char"] = data_reviews["content"].apply(len)

data_reviews["review_body_n_words"] = data_reviews["content"].str.count(
    word_pattern
)
data_reviews["review_body_avg_word_length"] = data_reviews.eval(
    "review_body_n_char / review_body_n_words"
).replace([np.inf, np.nan], 0)
Code
# Ratios (content to title)
data_reviews["review_body_title_n_char_ratio"] = data_reviews.eval(
    "review_body_n_char / review_title_n_char"
).replace([np.inf, np.nan], 0)

data_reviews["review_body_title_n_words_ratio"] = data_reviews.eval(
    "review_body_n_words / review_title_n_words"
).replace([np.inf, np.nan], 0)

data_reviews["review_body_title_avg_word_length_ratio"] = data_reviews.eval(
    "review_body_avg_word_length / review_title_avg_word_length"
).replace([np.inf, np.nan], 0)
Code
# Check if duplicates are removed
print(f"n duplicates = {sum(data_reviews.duplicated())}")
n duplicates = 0
Code
my.count_unique(data_reviews)
n_unique percent_unique
podcast_id 111125 5.5%
review_title 1117416 55.2%
content 2008001 99.2%
rating 5 <0.1%
author_id 1451063 71.7%
created_at 2012360 99.4%
year 18 <0.1%
month 12 <0.1%
weekday 7 <0.1%
hour 24 <0.1%
review_title_n_char 101 <0.1%
review_title_n_words 27 <0.1%
review_title_avg_word_length 767 <0.1%
review_body_n_char 3595 0.2%
review_body_n_words 888 <0.1%
review_body_avg_word_length 37487 1.9%
review_body_title_n_char_ratio 53313 2.6%
review_body_title_n_words_ratio 4725 0.2%
review_body_title_avg_word_length_ratio 297529 14.7%
Code
# Preview
data_reviews.head(2)
podcast_id review_title content rating author_id created_at year month weekday hour review_title_n_char review_title_n_words review_title_avg_word_length review_body_n_char review_body_n_words review_body_avg_word_length review_body_title_n_char_ratio review_body_title_n_words_ratio review_body_title_avg_word_length_ratio
0 c61aa81c9b929a66f0c1db6cbe5d8548 really interesting! Thanks for providing these insights.... 5 F7E5A318989779D 2018-04-24 12:05:16-07:00 2018 4 Tuesday 12 19 2 9.50 100 15 6.67 5.26 7.50 0.70
1 c61aa81c9b929a66f0c1db6cbe5d8548 Must listen for anyone interested in... Super excited to see this podcast gr... 5 F6BF5472689BD12 2018-05-09 18:14:32-07:00 2018 5 Wednesday 18 48 8 6.00 178 29 6.14 3.71 3.62 1.02

3.5 Several Data Frames

3.5.1 Inspection: Several Data Frames

The analysis reveals that dataset “reviews” contains information on 1523 more unique podcasts than the remaining tables. These additional podcasts will be excluded from the analyses, in which information from more than one table is needed. In “reviews” only analyses, these data will be included.

Code
print("Number of podcasts in each dataset:")
print(
    data_reviews.podcast_id.nunique(),
    data_podcasts.podcast_id.nunique(),
    data_categories.podcast_id.nunique(),
    sep=", ",
)

id_in_r = data_reviews.podcast_id.unique()
id_in_p = data_podcasts.podcast_id.unique()
id_in_c = data_categories.podcast_id.unique()

print(
    f"""
Podcast IDs:
- Only in "reviews", not in "podcasts"    {len(set(id_in_r).difference(id_in_p))}
- Only in "reviews", not in "categories"  {len(set(id_in_r).difference(id_in_c))}
- Only in "podcasts", not in "reviews"    {len(set(id_in_p).difference(id_in_r))}
- Only in "podcasts", not in "categories" {len(set(id_in_p).difference(id_in_c))}
- Only in "categories", not in "podcasts" {len(set(id_in_c).difference(id_in_p))}
- Only in "categories", not in "reviews"  {len(set(id_in_c).difference(id_in_r))}
"""
)
Number of podcasts in each dataset:
111125, 109602, 109602

Podcast IDs:
- Only in "reviews", not in "podcasts"    1523
- Only in "reviews", not in "categories"  1523
- Only in "podcasts", not in "reviews"    0
- Only in "podcasts", not in "categories" 0
- Only in "categories", not in "podcasts" 0
- Only in "categories", not in "reviews"  0

3.5.2 Pre-Processing: Several Data Frames

In this section, pre-processing is performed keeping in mind 3 purposes:

  1. to explore correlation between number of reviews and other features: this metric is calculated for each podcast, so correlation will be performed on feature values, aggregated for each podcast.
  2. to explore the correlation between rating and other features: as this metric is for each review (i.e., one podcast can have many reviews), no aggregation was introduced here.
  3. Explore ratings between categories.

Pre-processing to analyze the number of reviews:

Code
gr_reviews = data_reviews.groupby("podcast_id")

# Aggregates of rating
rating_summary = gr_reviews["rating"].agg(["count", "mean"])
rating_summary = rating_summary.set_axis(["n_reviews", "mean_rating"], axis=1)

# Aggregates of other metrics
cols_other_metrics = [
    "review_title_n_char",
    "review_title_n_words",
    "review_title_avg_word_length",
    "review_body_n_char",
    "review_body_n_words",
    "review_body_avg_word_length",
    "review_body_title_n_char_ratio",
    "review_body_title_n_words_ratio",
    "review_body_title_avg_word_length_ratio",
]

metrics_per_podcast = gr_reviews[cols_other_metrics].agg(["mean"])
# Create meaningful variable names:
new_index = metrics_per_podcast.columns.to_flat_index().map(
    lambda x: f"{x[1]}_{x[0]}"
)

# Merge the datasets
data_for_n_reviews_corr = my.merge_all(
    [
        rating_summary,
        metrics_per_podcast.set_axis(new_index, axis=1),
        n_categories_per_podcast.reset_index(),
        # Select ID and numeric columns from "podcasts"
        data_podcasts.set_index("podcast_id")
        .select_dtypes("number")
        .drop("itunes_id", axis=1)
        .reset_index(),
    ],
    on="podcast_id",
    how="inner",
)

# Preview
print(data_for_n_reviews_corr.shape)
data_for_n_reviews_corr.head(2)
(109602, 21)
podcast_id n_reviews mean_rating mean_review_title_n_char mean_review_title_n_words mean_review_title_avg_word_length mean_review_body_n_char mean_review_body_n_words mean_review_body_avg_word_length mean_review_body_title_n_char_ratio mean_review_body_title_n_words_ratio mean_review_body_title_avg_word_length_ratio n_categories n_subcategories slug_n_char slug_n_hyphen slug_n_char_wo_hyphen slug_avg_word_length podcast_title_n_char podcast_title_n_words podcast_title_avg_word_length
0 a00018b54eb342567c94dacfb2a3e504 1 5.00 16.00 2.00 8.00 26.00 4.00 6.50 1.62 2.00 0.81 1 1 14 1 13 6.50 14 2 7.00
1 a00043d34e734b09246d17dc5d56f63c 1 5.00 12.00 2.00 6.00 144.00 26.00 5.54 12.00 13.00 0.92 1 2 37 4 33 6.60 37 5 7.40

Pre-processing to analyze ratings:

  • All data:
Code
data_all = my.merge_all(
    [data_podcasts, n_categories_per_podcast.reset_index(), data_reviews],
    on="podcast_id",
    how="inner",
)
data_all.shape
data_all.head(2)
podcast_id itunes_id slug itunes_url podcast_title slug_n_char slug_n_hyphen slug_n_char_wo_hyphen slug_avg_word_length podcast_title_n_char podcast_title_n_words podcast_title_avg_word_length n_categories n_subcategories review_title content rating author_id created_at year month weekday hour review_title_n_char review_title_n_words review_title_avg_word_length review_body_n_char review_body_n_words review_body_avg_word_length review_body_title_n_char_ratio review_body_title_n_words_ratio review_body_title_avg_word_length_ratio
0 a00018b54eb342567c94dacfb2a3e504 1313466221 scaling-global https://podcasts.apple.com/us/podcas... Scaling Global 14 1 13 6.50 14 2 7.00 1 1 Very informative Great variety of speakers! 5 CC47C85896D423B 2017-11-29 12:16:43-07:00 2017 11 Wednesday 12 16 2 8.00 26 4 6.50 1.62 2.00 0.81
1 a00043d34e734b09246d17dc5d56f63c 158973461 cornerstone-baptist-church-of-orlando https://podcasts.apple.com/us/podcas... Cornerstone Baptist Church of Orlando 37 4 33 6.60 37 5 7.40 1 2 Good Sernons I'm a regular listener. I only wish... 5 103CC9DA2046218 2019-10-08 04:23:32-07:00 2019 10 Tuesday 4 12 2 6.00 144 26 5.54 12.00 13.00 0.92
  • Numeric variables only:
Code
data_for_rating_corr = pd.concat(
    [
        data_all.loc[:, "rating"],
        data_all.loc[:, "slug_n_char":"n_subcategories"],
        data_all.loc[
            :, "review_title_n_char":"review_body_title_avg_word_length_ratio"
        ],
    ],
    axis=1,
)

# Preview
print(data_for_rating_corr.shape)
data_for_rating_corr.head(2)
(2000293, 19)
rating slug_n_char slug_n_hyphen slug_n_char_wo_hyphen slug_avg_word_length podcast_title_n_char podcast_title_n_words podcast_title_avg_word_length n_categories n_subcategories review_title_n_char review_title_n_words review_title_avg_word_length review_body_n_char review_body_n_words review_body_avg_word_length review_body_title_n_char_ratio review_body_title_n_words_ratio review_body_title_avg_word_length_ratio
0 5 14 1 13 6.50 14 2 7.00 1 1 16 2 8.00 26 4 6.50 1.62 2.00 0.81
1 5 37 4 33 6.60 37 5 7.40 1 2 12 2 6.00 144 26 5.54 12.00 13.00 0.92

Pre-processing to analyze categories:

Code
data_categories_reviews_ratings = my.merge_all(
    [
        data_categories[["podcast_id", "category"]].drop_duplicates(),
        data_for_n_reviews_corr[["podcast_id", "n_reviews", "mean_rating"]],
    ],
    on="podcast_id",
    how="inner",
)

# Sort categories by rating in plots and tables
cats_sorted = (
    data_categories_reviews_ratings.groupby("category")["mean_rating"]
    .mean()
    .sort_values(ascending=False)
    .index
)
data_categories_reviews_ratings["category"] = pd.Categorical(
    data_categories_reviews_ratings["category"], cats_sorted
)

# Preview
print(data_categories_reviews_ratings.shape)
data_categories_reviews_ratings.head(2)
(140421, 4)
podcast_id category n_reviews mean_rating
0 c61aa81c9b929a66f0c1db6cbe5d8548 arts 2 5.00
1 c61aa81c9b929a66f0c1db6cbe5d8548 music 2 5.00

4 Analysis

4.1 Overall Rating

Questions:

  1. What is the true mean of the overall rating?
  2. Is there a dominant rating value?

Population: Reviews of podcasts on the Apple Podcasts platform.

Code
ratings = data_reviews["rating"]

print("Rating:")
my.display_collapsible(
    my.calc_summaries(ratings), "Descriptive statistics (overall rating)"
)
my.display_collapsible(
    [round(i, 3) for i in sms.DescrStatsW(ratings).tconfint_mean()],
    "Confidence interval of mean (overall rating)",
    sep=" – ",
)

print("Analysis of counts per rating:")
counts = ratings.value_counts().sort_index()
analyze_counts(
    counts,
    by="rating",
    title="Overall Rating",
    rot=0,
);
Rating:
Analysis of counts per rating:
Descriptive statistics (overall rating)
count min max range mean median std mad skew
rating 2024681 1.00 5.00 4.00 4.63 5.00 1.04 0.00 -2.78
Confidence interval of mean (overall rating)4.629 – 4.631
Omnibus (chi-squared) test resultsChi square test, χ²(4, n = 2024681) = 5633146.82, p < 0.001
Review counts with 95% CI and post-hoc (pairwise chi-squared) test results
  rating n_reviews percent ci_lower ci_upper cld spaced_cld
0 1 113,197 5.6% 5.5% 5.6% a a____
1 2 44,517 2.2% 2.2% 2.2% b _b___
2 3 50,670 2.5% 2.5% 2.5% c __c__
3 4 61,360 3.0% 3.0% 3.1% d ___d_
4 5 1,754,937 86.7% 86.6% 86.7% e ____e
Descriptive statistics of group (rating) counts
count min max range mean median std mad skew
n_reviews 5 44,517 1,754,937 1,710,420 404,936 61,360 755,160 16,843 2.23
percent 5 2.2% 86.7% 84.5% 20.0% 3.0% 37.3% 0.8% 2.23

Results. Rating scores range from 1 (worst) to 5 (best) points. The distribution of rating values is highly skewed with mean overall rating equal to 4.630 (CI 4.629–4.631). The median of ratings is 5 points and the dominant rating (which appeared in 86.7% of reviews) is also 5 points. All the remaining rating values are much less frequent (the differences in proportions between all five rating groups are significant).

4.2 Podcast Categories

Question: Which number of categories per podcast is most common?
Population: Podcasts on the Apple Podcasts platform.

Code
n_podcasts_per_n_main_categories = (
    n_categories_per_podcast["n_categories"].value_counts().sort_index()
)
analyze_counts(
    n_podcasts_per_n_main_categories,
    "n_categories",
    counts_of="podcast",
    xlabel="Number of categories per podcast",
    title="Popularity of Categories' Count per Podcast",
    rot=0,
);
Omnibus (chi-squared) test resultsChi square test, χ²(4, n = 109602) = 219804.47, p < 0.001
Podcast counts with 95% CI and post-hoc (pairwise chi-squared) test results
  n_categories n_podcasts percent ci_lower ci_upper cld spaced_cld
0 1 80,188 73.2% 72.8% 73.5% a a____
1 2 28,090 25.6% 25.3% 26.0% b _b___
2 3 1,246 1.1% 1.1% 1.2% c __c__
3 4 75 0.1% 0.1% 0.1% d ___d_
4 5 3 <0.1% <0.1% <0.1% e ____e
Descriptive statistics of group (n_categories) counts
count min max range mean median std mad skew
n_podcasts 5 3 80,188 80,185 21,920 1,246 34,707 1,243 1.68
percent 5 <0.1% 73.2% 73.2% 20.0% 1.1% 31.7% 1.1% 1.68

Results. Podcasts have up to 5 main categories. Differences in proportions between all the groups (of the number of categories per podcast) are statistically significant. Almost 3/4-ths of podcasts (n=80,188, 73.2% CI 72.8%–73.5%) have only one category and this is the most common choice, and slightly more than 1/4-th (n=28,090, 25.6% CI 25.3%–26.0%) have 2 categories.


Question: Which number of categories per podcast is best rated?
Population: Podcasts on the Apple Podcasts platform.

Code
analyze_numeric_groups(
    "mean_rating",
    by="n_categories",
    data=data_for_n_reviews_corr,
    xlabel="Number of categories per podcast",
    ylabel="Mean rating",
    title="Rating of Categories' Count per Podcast",
);
Omnibus (Kruskal-Wallis) test results
Source ddof1 H p-unc
Kruskal n_categories 4 309.54 p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI) of ratings
n_categories cld spaced_cld mean ci_lower ci_upper
0 1 a a__ 4.80 4.80 4.81
1 2 b _b_ 4.83 4.82 4.83
2 3 c __c 4.84 4.82 4.86
3 4 bc _bc 4.73 4.58 4.87
4 5 abc abc 4.87 4.47 5.27
Descriptive statistics of group (n_categories) means
count min max range mean median std mad skew
mean 5 4.73 4.87 0.14 4.81 4.83 0.05 0.02 -1.21

Note. Even though for 5 categories per podcast the upper bound of CI is above 5 (the CI calculation method does not know what is the range of possible values of ratings), the plot shows rating limits only up to 5.

Results. Change from 1 to 3 categories per podcast shows a small increasing trend (from rating 4.80 to 4.84). For 4 and 5 categories per podcast errors are high due to the small sample size and no trends are visible here.


Question: Which number of sub-categories per podcast is the most common?
Population: Podcasts on the Apple Podcasts platform.

Code
n_podcasts_per_n_sub_categories = (
    n_categories_per_podcast["n_subcategories"].value_counts().sort_index()
)
analyze_counts(
    n_podcasts_per_n_sub_categories,
    "n_subcategories",
    counts_of="podcast",
    xlabel="Number of sub-categories per podcast",
    ylabel="Number of podcasts",
    title="Popularity of Sub-Categories' Count per Podcast",
    rot=0,
);
Omnibus (chi-squared) test resultsChi square test, χ²(7, n = 109602) = 171282.29, p < 0.001
Podcast counts with 95% CI and post-hoc (pairwise chi-squared) test results
  n_subcategories n_podcasts percent ci_lower ci_upper cld spaced_cld
0 1 44,145 40.3% 39.9% 40.7% a a_______
1 2 39,373 35.9% 35.5% 36.3% b _b______
2 3 16,473 15.0% 14.7% 15.3% c __c_____
3 4 8,796 8.0% 7.8% 8.3% d ___d____
4 5 648 0.6% 0.5% 0.7% e ____e___
5 6 146 0.1% 0.1% 0.2% f _____f__
6 7 18 <0.1% <0.1% <0.1% g ______g_
7 8 3 <0.1% <0.1% <0.1% h _______h
Descriptive statistics of group (n_subcategories) counts
count min max range mean median std mad skew
n_podcasts 8 3 44,145 44,142 13,700 4,722 18,309 4,712 1.09
percent 8 <0.1% 40.3% 40.3% 12.5% 4.3% 16.7% 4.3% 1.09

Results. Differences between all groups are significant. Most common are podcasts with a single subcategory (40.3%), but two-subcategory podcasts are almost equally common (35.9%).

We can notice that single-category podcasts in the sample have up to 4 sub-categories:

Code
categories_crosstab = pd.crosstab(
    n_categories_per_podcast.n_subcategories,
    n_categories_per_podcast.n_categories,
)

categories_crosstab
n_categories 1 2 3 4 5
n_subcategories
1 44145 0 0 0 0
2 30907 8466 0 0 0
3 5085 11115 273 0 0
4 51 8316 415 14 0
5 0 188 436 23 1
6 0 5 113 28 0
7 0 0 7 10 1
8 0 0 2 0 1



Question: Which categories are the most popular among Apple Podcasts’ creators?
Population: Podcast categories on the Apple Podcasts platform.

Code
n_podcasts_per_category = data_categories["category"].value_counts()

res_counts_category = analyze_counts(
    n_podcasts_per_category,
    by="category",
    counts_of="podcast",
    ylabel="Number of podcasts *",
    title="Popularity of Categories",
    rot=90,
    label_rotation=90,
    y_lim_max=34900,
)

plt.figtext(
    0.2,
    -0.5,
    "* The same podcast can have several categories",
    wrap=True,
    horizontalalignment="center",
    fontsize=8,
);
Omnibus (chi-squared) test resultsChi square test, χ²(18, n = 211760) = 109953.36, p < 0.001
Podcast counts with 95% CI and post-hoc (pairwise chi-squared) test results
  category n_podcasts percent ci_lower ci_upper cld spaced_cld
0 society/culture 27,536 13.0% 12.8% 13.2% a a_______________
1 religion/spirituality 23,409 11.1% 10.9% 11.3% b _b______________
2 business 22,390 10.6% 10.4% 10.8% c __c_____________
3 education 20,700 9.8% 9.6% 10.0% d ___d____________
4 arts 16,099 7.6% 7.4% 7.8% e ____e___________
5 comedy 14,548 6.9% 6.7% 7.0% f _____f__________
6 health/fitness 14,504 6.8% 6.7% 7.0% f _____f__________
7 sports 13,138 6.2% 6.0% 6.4% g ______g_________
8 leisure 12,023 5.7% 5.5% 5.8% h _______h________
9 tv/film 10,794 5.1% 5.0% 5.2% i ________i_______
10 news 10,755 5.1% 4.9% 5.2% i ________i_______
11 music 7,807 3.7% 3.6% 3.8% j _________j______
12 kids/family 4,843 2.3% 2.2% 2.4% k __________k_____
13 science 3,375 1.6% 1.5% 1.7% l ___________l____
14 technology 3,227 1.5% 1.4% 1.6% l ___________l____
15 fiction 2,762 1.3% 1.2% 1.4% m ____________m___
16 history 1,657 0.8% 0.7% 0.8% n _____________n__
17 crime 1,255 0.6% 0.5% 0.6% o ______________o_
18 government 938 0.4% 0.4% 0.5% p _______________p
Descriptive statistics of group (category) counts
count min max range mean median std mad skew
n_podcasts 19 938 27,536 26,598 11,145 10,794 8,251 7,419 0.44
percent 19 0.4% 13.0% 12.6% 5.3% 5.1% 3.9% 3.5% 0.44

Results. The chi-square goodness-of-fit test shows significant differences between the popularity of categories among podcast creators (p < 0.001). So post-hoc pair-wise comparisons were performed. Category “society/culture” 13.0% (95% CI 12.8%-13.2%) was the most popular followed by “religion/spirituality” 11.1% (95% CI 10.9%-11.3%), “business” 10.6% (95% CI 10.4%-10.8%), and “education” 9.8% (95% CI 9.6%-10.0%). Differences were significant between all the groups except “comedy” and “health/fitness”, “tv/film” and “news” as well as “science” and “technology”.



Question: Which categories are best rated?
Population: Podcast categories on the Apple Podcasts platform.

Code
res_rating_category = analyze_numeric_groups(
    "mean_rating",
    by="category",
    data=data_categories_reviews_ratings,
    ylabel="Mean rating",
    title="Rating of Categories",
)

plt.figtext(
    0.2,
    -0.5,
    "* The same podcast can have several categories",
    wrap=True,
    horizontalalignment="center",
    fontsize=8,
)

plt.xticks(rotation=90);
Omnibus (Kruskal-Wallis) test results
Source ddof1 H p-unc
Kruskal category 18 3039.94 p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI) of ratings
category cld spaced_cld mean ci_lower ci_upper
0 business a a__________ 4.89 4.88 4.90
1 religion/spirituality a a__________ 4.88 4.87 4.89
2 health/fitness bc _bc________ 4.86 4.85 4.87
3 education b _b_________ 4.84 4.83 4.85
4 comedy cd __cd_______ 4.84 4.83 4.85
5 society/culture cd __cd_______ 4.83 4.82 4.84
6 kids/family de ___de______ 4.83 4.81 4.85
7 music bc _bc________ 4.83 4.81 4.84
8 fiction efg ____efg____ 4.80 4.78 4.83
9 leisure ef ____ef_____ 4.80 4.79 4.81
10 arts ef ____ef_____ 4.79 4.78 4.80
11 sports f _____f_____ 4.78 4.77 4.79
12 tv/film h _______h___ 4.75 4.74 4.77
13 technology fg _____fg____ 4.75 4.72 4.77
14 science hi _______hi__ 4.71 4.68 4.74
15 history j _________j_ 4.69 4.66 4.72
16 government ghi ______ghi__ 4.67 4.63 4.72
17 news i ________i__ 4.66 4.64 4.67
18 crime k __________k 4.49 4.44 4.53
Descriptive statistics of group (category) means
count min max range mean median std mad skew
mean 19 4.49 4.89 0.41 4.77 4.80 0.10 0.05 -1.47

Results. “Business” (mean rating 4.89 CI 4.88–4.90) and “religion/spirituality” (4.88 CI 4.87-4.89) are the best-rated categories (no statistically significant difference between them). They differ significantly from the other categories, but the difference from the next best categories is tiny: the difference is only at the second decimal position or rating value. The smallest average rating (4.49 CI 4.44–4.53) is in the “crime” category.


Question: Is there a relationship between the number of reviews and the average rating grouped by category?

Code
plot_scatter(
    res_counts_category,
    res_rating_category,
    by="category",
    corr="Spearman",
    counts_of="podcast",
)
plt.xlabel("Number of Podcasts per Category")
plt.title("Relationship Between Metrics \nAggregated by Category");
Spearman's correlation: r = 0.77, p < 0.001

Results: average category rating and number of podcasts per category are positively correlated (Spearman r = 0.77, p < 0.001).

4.4 Correlation between Number of Reviews, Rating and Other Features

Question: Does the rating of reviews correlate to other features (number of categories/sub-categories per podcast, various podcast title, review title and body length metrics)?
Population: Reviews of podcasts on the Apple Podcasts platform.

Spearman correlation analysis is performed.

Code
# Correlation between rating and other features:
corr_result = do_corr_1_vs_other(data_for_rating_corr, var_x="rating")

# Correlation between number of reviews and other features:
corr_result = do_corr_1_vs_other(
    data_for_n_reviews_corr, var_x="n_reviews", corr_result=corr_result
)

# To data frame
corr_result = pd.DataFrame(corr_result)

# Sort by correlation strength
corr_result = corr_result.sort_values(by="r", key=abs, ascending=False)

# Control for Type 1 error: add column for p adjusted values
corr_result["p_adj"] = [
    my.format_p(i)
    for i in sms.multipletests(corr_result["p"], method="holm")[1]
]

Correlation between rating and other features:

Code
corr_result.query("variable_1=='rating'")
variable_1 variable_2 r p p_adj
13 rating review_body_n_words -0.09 0.00 p < 0.001
12 rating review_body_n_char -0.09 0.00 p < 0.001
15 rating review_body_title_n_char_ratio -0.09 0.00 p < 0.001
16 rating review_body_title_n_words_ratio -0.07 0.00 p < 0.001
11 rating review_title_avg_word_length 0.05 0.00 p < 0.001
8 rating n_subcategories 0.05 0.00 p < 0.001
17 rating review_body_title_avg_word_length_ratio -0.04 0.00 p < 0.001
4 rating podcast_title_n_char 0.03 0.00 p < 0.001
2 rating slug_n_char_wo_hyphen 0.03 0.00 p < 0.001
0 rating slug_n_char 0.03 0.00 p < 0.001
10 rating review_title_n_words -0.03 0.00 p < 0.001
5 rating podcast_title_n_words 0.03 0.00 p < 0.001
1 rating slug_n_hyphen 0.03 0.00 p < 0.001
14 rating review_body_avg_word_length 0.03 0.00 p < 0.001
6 rating podcast_title_avg_word_length 0.02 0.00 p < 0.001
9 rating review_title_n_char -0.01 0.00 p < 0.001
3 rating slug_avg_word_length 0.01 0.00 p < 0.001
7 rating n_categories -0.00 0.00 p < 0.001

Results: Despite the fact, that it is statistically significant, the relationship between ratings and other variables of investigation is small (|r| < 0.10). The highest is between:

  • rating and review length (e.g., length in words, review_body_n_words) r = -0.09 (longer reviews to some extent tend to have worse ratings),
  • rating and word length of review title (review_title_avg_word_length) r = 0.05 (longer words in review title related to hardly noticeably better ratings) as well as
  • rating and number of sub-categories r = 0.05 (more sub-categories related to hardly noticeably better ratings).


Question: Does the number of reviews correlate to other features (mean podcast rating, number of categories/sub-categories per podcast, various podcast title, review title and body length metrics)?
Population: Podcasts on the Apple Podcast platform.

Correlation between number of reviews and other features:

Code
corr_result.query("variable_1=='n_reviews'")
variable_1 variable_2 r p p_adj
18 n_reviews mean_rating -0.40 0.00 p < 0.001
25 n_reviews mean_review_body_title_n_char_ratio 0.33 0.00 p < 0.001
23 n_reviews mean_review_body_n_words 0.32 0.00 p < 0.001
26 n_reviews mean_review_body_title_n_words_ratio 0.32 0.00 p < 0.001
22 n_reviews mean_review_body_n_char 0.32 0.00 p < 0.001
20 n_reviews mean_review_title_n_words 0.20 0.00 p < 0.001
28 n_reviews n_categories 0.17 0.00 p < 0.001
19 n_reviews mean_review_title_n_char 0.17 0.00 p < 0.001
29 n_reviews n_subcategories 0.14 0.00 p < 0.001
31 n_reviews slug_n_hyphen 0.06 0.00 p < 0.001
27 n_reviews mean_review_body_title_avg_word_leng... 0.06 0.00 p < 0.001
35 n_reviews podcast_title_n_words 0.06 0.00 p < 0.001
34 n_reviews podcast_title_n_char 0.05 0.00 p < 0.001
21 n_reviews mean_review_title_avg_word_length 0.05 0.00 p < 0.001
30 n_reviews slug_n_char 0.05 0.00 p < 0.001
32 n_reviews slug_n_char_wo_hyphen 0.05 0.00 p < 0.001
33 n_reviews slug_avg_word_length -0.03 0.00 p < 0.001
24 n_reviews mean_review_body_avg_word_length 0.02 0.00 p < 0.001
36 n_reviews podcast_title_avg_word_length -0.01 0.00 p < 0.001

As the correlation between n_reviews and mean_rating r = -0.40 looks suspicious (contradicts the results seen before), these variables will be plotted for further investigation:

Code
# Data
vars = ["n_reviews", "mean_rating"]
ranks_all = data_for_n_reviews_corr[vars].rank()
ranks_wo_rating_5 = (
    data_for_n_reviews_corr[vars].query("mean_rating < 5").rank()
)

# Correlation
res = do_corr_1_vs_other(ranks_all, "n_reviews")
res = do_corr_1_vs_other(ranks_wo_rating_5, "n_reviews", res)
res = pd.DataFrame(res)
res.insert(2, "subset", ["All ratings", "Rating < 5"])
display(res)

# Plot
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(8, 4))

# Subplot 1
sns.regplot(
    x="n_reviews",
    y="mean_rating",
    data=ranks_all,
    scatter_kws={"alpha": 0.01},
    line_kws={"color": "black"},
    ax=ax1,
)
ax1.set_xlabel("Rank of number of reviews")
ax1.set_ylabel("Rank of average rating")
ax1.set_title("All ratings")
my.ax_axis_comma_format(ax=ax1)

# Subplot 2
sns.regplot(
    x="n_reviews",
    y="mean_rating",
    data=ranks_wo_rating_5,
    scatter_kws={"alpha": 0.05},
    line_kws={"color": "black"},
    ax=ax2,
)

ax2.set_xlabel("Rank of number of reviews")
ax2.set_ylabel("Rank of average rating")
ax2.set_title("Ratings < 5")
my.ax_axis_comma_format(ax=ax2)

# Common
plt.suptitle("Relationship between Ranks of Mean Rating and Reviews Amount")
plt.tight_layout();
variable_1 variable_2 subset r p
0 n_reviews mean_rating All ratings -0.40 0.00
1 n_reviews mean_rating Rating < 5 0.60 0.00

Results: In general, the number of reviews seems to have larger correlation coefficient estimates to some variables than ratings to other variables.

The largest correlation was detected between the number of reviews per podcast and mean podcast rating but after graphical investigation, it turned out that if the largest ratings are removed, the sign of correlation coefficient (and the slope of the regression line) changes from negative (Spearman’s r = -0.40, p_adj < 0.001) to positive (r = 0.60, p < 0.001, p_adj was not calculated). This means that a high proportion of top ratings (equal values) distorted the results and indicates contradicting trends when these values are included and excluded.

A graphical investigation of a few more variables with top correlation did not reveal this kind of discrepancy (the plots are not included in the report).

The next biggest correlations were detected between:

  • amount of reviews and various review and review title length metrics (e.g., mean_review_body_title_n_char_ratio = 0.33, mean_review_body_n_words r = 0.32, mean_review_title_n_words r = 0.20) that essentially show that longer reviews and review titles as well as the bigger difference between review and review title length are related to a larger number of reviews.
  • the number of reviews and the number of categories (r = 0.17).

5 Looker Studio Dashboard

A part of this project was to create a dashboard with at least 3 charts in Looker Studio. So a tool that allows us to interactively explore temporal trends and patterns of ratings and number of reviews in various subsets of data was created and it is accessible here. The dashboard consists of 2 pages:

  1. The first page contains plots with Y-axis for ratings ranging from 1 to 5. This version of the dashboard is more flexible for interactive analysis as a wider range of possible mean rating values can be visualized.
  2. The second page contains the same plots just with zoomed rating values. This feature allows seeing (possible) patterns in ratings where differences between the values are small. But this version of the dashboard is less friendly for interactive exploration.

A preview of the dashboards can be seen in the print screens below:

  • blue columns represent the number of reviews and
  • red connected dots indicate the average rating.

Due to the 100 MB per dataset limitations of Looker Studio, only a subset of data (~92% of reviews from the data_all dataset) is used in the dashboard.

The dashboard in which Y axis for ratings (on the right hand side) ranges from 1 to 5.


The dashboard with zoomed rating values.

6 Main Findings

  1. Sample size:
    • More than 100 thousand podcasts and more than 2 million reviews are included in the analysis.
  2. Ratings (overall):
    • The distribution of ratings is asymmetric: the top rating (5 points) is the most common (in 87% of reviews);
    • The median rating is 5 stars too;
    • The mean overall rating is 4.63.
  3. The analysis of categories:
    • There are 19 main categories of podcasts (after merging similar sub-categories) and 110 sub-categories;
    • Each podcast has up to 5 main categories;
    • Single-category podcasts are most common (73% of all podcasts) but, compared to 2- and 3- category-per-podcast cases, are a little bit less well-rated (rating 4.80 vs. 4.83 and 4.84).
    • Average category rating and number of podcasts per category are positively correlated (Spearman r = 0.77, p < 0.001): categories with more reviews have better ratings (and vice versa).
    • The most common categories among podcast creators are:
      • “society/culture” (13% of podcasts),
      • “religion/spirituality” (11%),
      • “business” (11%), and
      • “education” (10%).
    • The least common categories are:
      • “history”, “crime” and “government” (< 1% each).
    • The best-rated categories are “business” (rating is 4.89) and “religion/spirituality” (4.88).
    • The next best-rated categories differ from the best ones only by the second decimal digit (the difference is negligible although it is statistically significant).
    • The worst-rated category is “crime” (4.49).
  4. Temporal patterns and trends:
    • Data covers the period from December 2005 to December 2022 (17 years).

    • Difference between minimum and maximum ratings and amount of reviews as well as a correlation between these two metrics (ns – not significant):

      Comparing Difference in Ratings Difference in Amount of Reviews Spearman’s Correlation
      Years (2006-2021) 0.28 23.3 % ns
      Months of year 0.05 1.5 % ns
      Days of week 0.04 6.8 % r = 0.79, p = 0.036
      Hours of day (in UTC-07:00) 0.22 5.4 % r = 0.69, p < 0.001
    • Year-to-year trends:

      • Clear trends in both the amount of reviews and ratings are visible:
        • yearly number of reviews was increasing till 2020 after that it is dropping.
        • the average yearly rating was increasing till 2014-2015 after that it is dropping.
    • Month-to-month patterns:

      • Overall, the largest amount of monthly reviews (9.2%) is created in October and the smallest (7.7%) in December.
      • Patterns of ratings are significant yet negligible.
    • Weekly patterns (by the day of the week):

      • The largest amount of reviews per day are created in the middle of a week (e.g., on Wednesday 17.0% of weekly reviews) and the smallest amount is on weekends (10.2% on each day).
      • Despite the fact that the range in ratings is small, a statistically significant relationship between the number of reviews and average rating grouped by weekday is detected (see table above).
    • Daily patterns (by hour of day):

      • Patterns in the number of reviews and ratings are very clear: during the day time of indicated time zone, more reviews per hour are created and better average ratings are given, while at night time (especially around 1 and 2 AM) fewer reviews are created and ratings are worse.
      • The relationship between the 2 metrics is statistically significant (see table above).
  5. Correlation analysis:
    • Between reviews’ ratings and other characteristics (including review, review title, podcast title length metrics, number of categories, and sub-categories), a statistically significant but extremely low correlation (Spearman’s |r| < 0.10) is detected.
    • The correlation between the number of reviews per podcast and mean podcast rating gave contradicting results so they will not be further commented.
    • The correlation between the number of reviews per podcast and review length metrics was significant and positive (up to r = 0.33) showing that podcasts with more reviews tend to have longer reviews.
    • The correlation between the number of reviews and the number of categories per podcast is significant and positive (r = 0.17).

7 Limitations and Suggestions for Improvement

Some ideas on how the analysis may be improved and what else can be analyzed:

  • Confidence intervals and statistical hypothesis testing results could be displayed on a plot.
  • In the analysis of temporal patterns and trends, periods of bigger granularity may influence patterns of lower granularity (e.g., weekly patterns may differ from year to year). This was not investigated.
  • A more detailed analysis of separate podcasts and their characteristics was not performed.
  • A more detailed analysis of separate users/reviewers was not performed.
  • A more detailed analysis of the review text, review title, and podcast title may be performed, e.g.:
    • Language identification.
    • Analysis of emojis.
    • Sentiment analysis.
  • The time duration between the newest and oldest review of each podcast may be analyzed.
  • Dashboard:
    • A more detailed description of the dashboard and used variables could be provided.
    • Filters for categories are not included in the dashboard.
    • Only a fraction of the data analyzed in Python was included in the dashboard (due to technical reasons).