the analysis must include inferential statistics procedures.
create a Looker Studio dashboard based on the analysis.
Abbreviations
CI – 95% confidence interval.
CLD - compact letter display.
gof, GOF – goodness of fit.
n – either sample size or group size.
ns – not significant.
p – p-value.
p_adj – p-value (adjusted).
r – correlation coefficient (Spearman’s).
UTC – coordinated universal time.
wo – without.
π – pi (proportion).
χ² – chi-squared.
1 Introduction
A podcast is a series of spoken word episodes, all focused on a particular topic or theme, like cycling or startups (source: The Podcast Host). It is a program made available in digital format either for download over the Internet (source: Wikipedia) or watching/listening online. Apple Podcasts is one of the prominent platforms dedicated to hosting podcasts. This project is focused on the analysis of podcasts (their categories, reviews and ratings) hosted on this platform.
1.1 Purpose
This analysis aims to investigate:
the number of reviews as well as
ratings of podcasts
and their
temporal trends and patterns as well as
relationships between these two variables and features including various podcasts’ titles review titles and review body length metrics and number of categories/sub-categories per podcast.
popularity of podcast categories.
2 Methods
For statistical inference, significance level \(\alpha=0.05\) and 95% confidence level are used.
2.1 Population and Sample
It is assumed that the data is a simple random sample or its equivalent that allows using inferential statistics methods. In most cases, where various features of reviews are analyzed, the population can be defined as:
reviews of podcasts on the Apple Podcasts platform.
In cases, where metrics aggregated for each podcast are used, the population can be defined as:
podcasts on the Apple Podcasts platform.
Lastly, in the analysis of categories, as the same podcast can have several categories, it was assumed that category is a separate independent observational unit (and category labels as well as podcast information were treated as features of each category). In this case, the population can be defined as:
podcasts categories on the Apple Podcasts platform.
2.2 Differences Between Groups
For statistical inference about several groups (categories), the following strategy is used:
first, 95% confidence intervals (CI) for each group are calculated,
second, an omnibus test is performed,
third, when the result of the omnibus statistical test is significant, a post-hoc analysis (pair-wise comparisons) is performed.
In post-hoc analysis, \(p\) values are adjusted with Holm correction to control the inflated probability of type 1 error. What is more, the results of pair-wise comparisons are compactly presented as CLD¹ (compact letter display) representation of adjusted \(p\) values.
¹ IMPORTANT: In CLD representation of the results, a letter shared by several groups (e.g., control and treatment) indicates that the results are statistically insignificant (\(p\ge\alpha\)), and lack of shared letter shows statistically significant results. The main advantage of CLD is that instead of \(\frac{k \cdot (k-1)}{2}\)\(p\) values in each-to-each group pair-wise comparisons we get \(k\) CLD representations of the same results, i.e., results are easier to understand if many groups are compared. Here \(k\) is the number of groups.
In the analysis of differences between group/category counts (e.g., the analysis of the number of reviews per certain time period),
the following methods are used:
plot: bar chart with absolute counts on the y-axis and percentage values as bar labels.
confidence intervals:Goodman’s simultaneous confidence intervals of proportions;
hypotheses: \(H_0:\) All proportions are equal: \(~ \pi_1 = \pi_2 = ... = \pi_i = ... = \pi_k\) \(H_1:\) At least two proportions differ: \(\pi_i \ne \pi_j\) for at least single pair of i and j.
post-hoc:pair-wise chi-square χ² GOF test (with Holm correction),
hypotheses: \(H_0:\) Both proportions are equal: \(~ \pi_i = \pi_j\), \(H_1:\) Proportions differ: \(~ \pi_i \ne \pi_j\).
Here \(\pi\) (pi) is a proportion (relative frequency, percentage) of values in a certain group, \(i\) and \(j\) are group indices (\(i\) = 1, 2, …, \(k\); \(j\) = 1, 2, …, \(k\); \(i \ne j\)), \(k\) – total number of groups.
In the analysis of differences between continuous/numeric data groups: (e.g., the analysis of ratings grouped by time periods),
the following methods are used:
plot: dot/scatter plot where dots represent means and error bars indicate 95% CI. As in most cases the sample size is very large, and most errors are small, so error bars may not be visible.
confidence intervals:t-distribution-based confidence intervals for mean;
omnibus: non-parametric Kruskal-Wallis test,
hypotheses: \(H_0:\) No stochastic dominance among the groups, \(H_1:\) At least one group stochastically dominates one other group.
post-hoc: non-parametric Conover-Iman test (with Holm correction),
hypotheses: \(H_0:\) No stochastic dominance between the two groups, \(H_1:\) One group stochastically dominates the other group.
The term “stochastic dominance” means that one of the groups tends to have (statistically) larger values than some other group.
2.3 Correlation Between Variables
For the relationship between numeric variables, non-parametric Spearman’s rank correlation analysis is used.
Hypotheses: \(H_0:\)\(\rho_s = 0\) (variables do not correlate), \(H_1:\)\(\rho_s \ne 0\) (variables correlate).
Here \(\rho_s\) (rho) is the population Spearman’s correlation coefficient.
In temporal patterns and trends analysis, no p-value correction is applied. And for the correlation analyses where variables of the number of reviews and ratings are used several times, to prevent an inflated type 1 error rate, Holm correction is applied.
3 Preparation and Inspection
In this section, statistical software is prepared for the analysis, data are imported into statistical software, inspected, and pre-preprocessed to be ready for a deeper statistical analysis.
3.1 Setup
Code
# Automatically reload certain modules%reload_ext autoreload%autoreload 1%aimport functions# Packages and modules -------------------------------import osimport warnings# To connect SQL databaseimport sqlalchemy as sql# Data wrangling, mathimport numpy as npimport pandas as pd# Statistical analysisimport scipy.stats as spsimport statsmodels.stats.api as smsimport pingouin as pgimport scikit_posthocs as sp# Visualizationimport matplotlib.pyplot as pltimport seaborn as sns# Custom functionsimport functions as my# Enable ability to run R in Python code cellsos.environ["R_HOME"] ="C:/PROGRA~1/R/R-4.3.1"with warnings.catch_warnings(): warnings.simplefilter("ignore")import rpy2%load_ext rpy2.ipython# Settings --------------------------------------------# Default plot optionsplt.rc("figure", titleweight="bold")plt.rc("axes", labelweight="bold", titleweight="bold")plt.rc("font", weight="normal", size=10)plt.rc("figure", figsize=(7, 3))# Pandas optionspd.set_option("display.max_rows", 1000)pd.set_option("display.max_columns", 100)pd.set_option("display.max_colwidth", 40) # Possible option: Nonepd.set_option("display.float_format", lambda x: f"{x:.2f}")pd.set_option("styler.format.thousands", ",")# colorsgreen, blue, orange, red ="tab:green", "tab:blue", "tab:orange", "tab:red"
In this section, ad-hoc (specialized for this analysis) functions are defined. The more generalizable functions can be found in functions.py and functions.R files of this project.
Code
# Main analysis functionsdef analyze_counts( counts, by: str, xlabel=None, ylabel=None, counts_of: str="review",**kwargs,):"""The main function to analyze count data - Performs omnibus chi-squared and post-hoc pair-wise chi-squared test. - Compactly presents results of post-hoc test as CLD (compact letter display, where shared CLD letter show no significant difference between groups).NOTE: cld computations require R. - Calculates percentages and their confidence intervals by using Goodman's method. - Creates summary of grouped values (group counts and percentages). - Plots results as bar plots with percentage labels. Args: counts (pandas.Series[int]): Count data to analyze. by (str): Grouping variable name. Used to create labels. xlabel (str, None, optional): X axis label. Defaults to None: autogenerated label. ylabel (str, None, optional): Y axis label. Defaults to None: autogenerated label. counts_of (str, optional): The thing that was counted. Used for labels. Defaults to "review". **kwargs: further arguments passed to `my.plot_counts_with_labels()` """# Omnibus test my.display_collapsible( my.test_chi_square_gof(counts), "Omnibus (chi-squared) test results" )# Post-hoc pairwise chi-square posthoc_p = my.pairwise_chisq_gof_test(counts) posthoc_cld = my.convert_pairwise_p_to_cld(posthoc_p, output_gr_var=by)# Confidence interval n_label =f"n_{counts_of}s" ci = ( my.ci_proportion_multinomial(counts, method="goodman", n_label=n_label) .rename_axis(by) .reset_index() )# Make sure datasets are mergeable ci[by] = ci[by].astype(str) posthoc_cld[by] = posthoc_cld[by].astype(str)# Merge results n_per_category = pd.merge(ci, posthoc_cld, on=by)# Descriptive statistics: calculate to_format = ["min", "max", "range", "mean", "median", "std", "mad"] format_0f =lambda x: [f"{i:,.0f}"for i in x] summary_count = my.calc_summaries(ci[n_label]) summary_count[to_format] = summary_count[to_format].apply(format_0f) summary_perc = my.calc_summaries(ci["percent"]) summary_perc[to_format] = summary_perc[to_format].apply(my.format_percent)# Format percentages and countsvars= ["percent", "ci_lower", "ci_upper"] n_per_category[vars] = n_per_category[vars].apply(my.format_percent) my.display_collapsible( n_per_category.style.format({n_label: "{:,.0f}"}),f"{counts_of.capitalize()} counts with 95% CI and post-hoc "" (pairwise chi-squared) test results", )# Descriptive statistics: display my.display_collapsible( pd.concat([summary_count, summary_perc]),f"Descriptive statistics of group ({by}) counts", )# Plotif xlabel isNone: xlabel = by.capitalize()if ylabel isNone: ylabel =f"Number of {counts_of}s" my.plot_counts_with_labels( n_per_category, x_lab=xlabel, y_lab=ylabel, y=n_label, **kwargs ) my.ax_axis_comma_format("y")# Outputreturn n_per_categorydef analyze_numeric_groups( y: str, by: str, data, title=None, xlabel=None, ylabel=None):"""The main function to analyze numeric/continuous data by groups - Calculates mean ratings per group and their confidence intervals using t distribution. - Performs omnibus (Kruskal-Wallis) and post-hoc (Conover-Iman) tests. - Compactly presents results of post-hoc test as CLD (compact letter display, where shared CLD letter show no significant difference between groups).NOTE: cld computations require R. - Creates summary of grouped values (group counts and percentages). - Plots results as points with 95% confidence interval error bars. Args: y (str): Name of numeric/continuous (dependent) variable. by (str): Name of grouping (independent) variable. data (pandas.DataFrame): data frame with variables indicated in `y` and `by`. xlabel (str, None, optional): X axis label. Defaults to None: capitalized value of `by`. ylabel (str, None, optional): Y axis label. Defaults to None: capitalized value of `y`. title (str, None, optional): The title of the plot. Defaults to None. """# Omnibus test: Kruskal-Wallis test stat_rez = pg.kruskal(data=data, dv=y, between=by) stat_rez["p-unc"] = my.format_p(stat_rez["p-unc"][0]) my.display_collapsible(stat_rez, "Omnibus (Kruskal-Wallis) test results")# Confidence intervals ci_raw = data.groupby(by)[y].apply(lambda x: [np.mean(x), *sms.DescrStatsW(x).tconfint_mean()] ) ci = pd.DataFrame(list(ci_raw), index=ci_raw.index, columns=["mean", "ci_lower", "ci_upper"], ).reset_index()# Post-hoc test: Conover-Iman test posthoc_p_matrix = sp.posthoc_conover( data, val_col=y, group_col=by, p_adjust="holm" ) posthoc_p_df = my.as_df( posthoc_p_matrix.stack(), ["group1", "group2"], "p.adj" ) posthoc_cld = my.convert_pairwise_p_to_cld(posthoc_p_df, output_gr_var=by)# Make sure datasets are mergeable ci[by] = ci[by].astype(str) posthoc_cld[by] = posthoc_cld[by].astype(str) ci_and_cld = pd.merge(posthoc_cld, ci, on=by)# Display CI and CLD my.display_collapsible( ci_and_cld,"Post-hoc (Conover-Iman) test results as CLD and ""Confidence intervals (CI) of ratings", )# Descriptive statistics of means my.display_collapsible( my.calc_summaries(ci["mean"]),f"Descriptive statistics of group ({by}) means", )# Plot fig, ax = plt.subplots() x = ci.iloc[:, 0] ax.errorbar( x=x, y=ci["mean"], yerr=[ci["mean"] - ci["ci_lower"], ci["ci_upper"] - ci["mean"]], mfc="red", ms=2, mew=1, fmt="ko", zorder=3, ) ax.axhline( y=ci["mean"].mean(), color="lightgray", linestyle="--", label="Reference line (mean of means)", zorder=1, )if xlabel isNone: xlabel = by.capitalize()if ylabel isNone: ylabel = y.capitalize() ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_ylim([4, 5]) ax.set_title(title) ax.legend(frameon=False, loc="lower right")# Outputreturn ci_and_clddef analyze_ratings( by, title=None, xlabel=None, ylabel=None, y="rating", data=None):"""The main function to analyze ratings See the description of function `analyze_numeric_groups()` for more details. The environment, in which function is called, must have variable (data frame) `data_reviews`. """if data isNone: data = data_reviewsreturn analyze_numeric_groups( y=y, by=by, data=data, title=title, xlabel=xlabel, ylabel=ylabel )def plot_scatter(data_x, data_y, by, corr=None, counts_of: str="review"):"""Plot relationship between mean rating and number of reviews aggregated by group This is convenience function to merge and plot the results of functions `analyze_counts()` and `analyze_ratings()`. Args: data_x (pandass.Dataframe): Data frame with required data. data_y (pandass.Dataframe): Data frame with required data. by (str): variable used for aggregation. corr (str, None, optional): if "Spearman", the results of Spearman's correlation analysis are printed. counts_of (str, optional): The thing that was counted. Used for creating variable name and labels. Defaults to "review". """assert corr in ["Spearman",None, ], f"Unknown value for argument `corr`: {corr}" x_var =f"n_{counts_of}s" data_merged = pd.merge(data_x, data_y, on=by) data_merged.plot.scatter( x=x_var, y="mean", figsize=(3.5, 3), rot=45, xlabel=f"Number of {counts_of}s", ylabel="Average rating", title=f"Relationship Between Metrics \n"f"Aggregated by {by.capitalize()}", ) my.ax_axis_comma_format("x") r, p = sps.spearmanr(data_merged[x_var], data_merged["mean"])if corr =="Spearman":print(f"Spearman's correlation: r = {r:.2f}, {my.format_p(p)}")def do_corr_1_vs_other(data, var_x, corr_result=None):"""Perform Spearman correlation analysis in 1-vs-remaining style Perform Spearman correlation between the selected and all the remaining numeric variables in the dataset. Args: data (pandas.DataFrame): Data frame with variables to analyze. var_x (str): variable used as the first variable in correlation analysis. corr_result (list or None): List to collect the results of analysis. Defaults to None. Returns: list: `corr_result` with appended results """# Initiate list, if not providedif corr_result isNone: corr_result = []# The second column (var_y) must be numeric columns_to_include = ( data.select_dtypes("number").drop(var_x, axis=1).columns )# Do analysis and collect the resultsfor var_y in columns_to_include:if var_x == var_y:continue r, p = sps.spearmanr(data[var_x], data[var_y], nan_policy="omit") corr_result.append( {"variable_1": var_x, "variable_2": var_y, "r": r, "p": p} )# Return resultsreturn corr_result
3.2 Database
The database is downloaded from Kaggle to the directory data and renamed to podcast-reviews-database-v27.sqlite. At the moment of the analysis, the newest version of the database was v27. It consists of 4 tables:
runs: facts on database updates.
podcasts: podcast names, URLs, and similar data.
categories: data on categories of podcasts.
reviews: reviews and ratings of podcasts.
Code
# Create a connection to the database.db_engine = sql.create_engine("sqlite:///data/podcast-reviews-database-v27.sqlite")
Code
# List tablessql.inspect(db_engine).get_table_names()
['categories', 'podcasts', 'reviews', 'runs']
Entity relationship diagram (ERD) of this database:
As metadata is poor in this database, there are no connections drawn between the tables. The following sections will inspect, explore and analyze the data in more detail.
3.3 Data Import and Inspection
This section deals with individual (not merged) data frames.
3.3.1 Import and Inspection: “runs”
The runs dataset contains information on when the database was last updated.
Code
data_runs = pd.read_sql("SELECT * FROM runs;", db_engine)print(f'The database was last updated on: {data_runs["run_at"].max()}')
The database was last updated on: 2022-12-10 01:17:07
3.3.2 Import and Inspection: “podcasts”
The podcass table contains data on 109,602 podcasts. There are 745 podcasts with a non-unique name. No other discrepancies were detected.
Code
data_podcasts = pd.read_sql("SELECT * FROM podcasts;", db_engine)
The categories dataset contains data on podcast categories. There are 211,760 entries for categories (one row per podcast-category combination) and 109,602 unique podcasts (the same number as in the previous dataset).
Before importing into Python, some similar categories were merged to create a new variable category. The old variable with categories was renamed to sub_category (i.e., these categories no are treated as sub-categories). So there are 19 categories and 110 sub-categories in the dataset.
Code
data_categories = pd.read_sql_query("""--sql SELECT c.podcast_id, CASE WHEN c.category LIKE 'arts%' THEN 'arts' WHEN c.category LIKE 'business%' THEN 'business' WHEN c.category LIKE 'comedy%' THEN 'comedy' WHEN c.category LIKE 'true-crime%' THEN 'crime' WHEN c.category LIKE 'education%' THEN 'education' WHEN c.category LIKE 'fiction%' THEN 'fiction' WHEN c.category LIKE 'tv-film%' THEN 'tv/film' WHEN c.category LIKE 'government%' THEN 'government' WHEN c.category LIKE 'health-fitness%' THEN 'health/fitness' WHEN c.category LIKE 'history%' THEN 'history' WHEN c.category LIKE 'kids-family%' THEN 'kids/family' WHEN c.category LIKE 'leisure%' THEN 'leisure' WHEN c.category LIKE 'music%' THEN 'music' WHEN c.category LIKE 'news%' THEN 'news' WHEN c.category LIKE 'science%' THEN 'science' WHEN c.category LIKE 'society-culture%' THEN 'society/culture' WHEN c.category LIKE 'sports%' THEN 'sports' WHEN c.category LIKE 'technology%' THEN 'technology' WHEN c.category LIKE 'religion%' OR c.category LIKE 'spirituality%' OR c.category LIKE 'buddhism%' OR c.category LIKE 'christianity%' OR c.category LIKE 'hinduism%' OR c.category LIKE 'islam%' OR c.category LIKE 'judaism%' THEN 'religion/spirituality' END AS category, c.category as sub_category FROM categories AS c; """, db_engine,)
There are 2,025,333 entries in the dataset and 652 are identified as duplicates. The number of unique review authors (n = 1,451,063) is smaller than the number of unique podcasts (the ratio of the number of podcasts to the number of unique authors is ~0.7, which means that, on average, one in 2 users creates a single review and every second user creates at least 2 reviews). The number of unique podcasts is 111,125 and this number is higher than in previous datasets. This should be investigated in more detail (see the section on several datasets).
All timestamps of reviews are in UTC-07:00 timezone. Some dates were in a strange format/language (e.g., ٢٠٢٠-١٢-٢٤T١١:٥٥:٠٨-07:00) but Python functions manage to automatically standardize even this type of dates.
No other discrepancies were found.
Code
data_reviews = pd.read_sql("SELECT * FROM reviews;", db_engine)
This section deals with individual (not merged) data frames. Data frame “runs” was not included in the further analysis so there is no dedicated sub-subsection for it.
3.4.1 Pre-Processing: “podcasts”
Information on slug and podcast title length metrics were extracted as additional features:
For slug:
number of characters,
number of hyphens,
number of characters without hyphens,
average word length.
For podcast title:
number of characters,
number of words,
average word length.
Code
# Here word is substring that contains digits, letters, underscores and hyphens# (when hyphens are in the middle of the word)word_pattern =r"\w+((-\w)*\w)*"data_podcasts = data_podcasts.rename({"title": "podcast_title"}, axis=1)# Info on slugdata_podcasts["slug_n_char"] = data_podcasts["slug"].apply(len)data_podcasts["slug_n_hyphen"] = data_podcasts["slug"].str.count("-")data_podcasts["slug_n_char_wo_hyphen"] = data_podcasts.eval("slug_n_char - slug_n_hyphen")data_podcasts["slug_avg_word_length"] = data_podcasts.eval("slug_n_char_wo_hyphen / (slug_n_hyphen + 1)")# Info on podcast titledata_podcasts["podcast_title_n_char"] = data_podcasts["podcast_title"].apply(len)data_podcasts["podcast_title_n_words"] = data_podcasts["podcast_title"].str.count(word_pattern)data_podcasts["podcast_title_avg_word_length"] = data_podcasts.eval("podcast_title_n_char / podcast_title_n_words")
The step of merging categories was performed before importing this dataset into Python (find it in the section of “categories” data import and inspection).
Now one additional dataset for category counts will be created: this dataset will contain a single row per podcast.
Code
n_categories_per_podcast = pd.concat( [# n categories per podcast data_categories.drop_duplicates(subset=["podcast_id", "category"])["podcast_id" ] .value_counts() .rename("n_categories"),# n sub-categories per podcast data_categories["podcast_id"].value_counts().rename("n_subcategories"), ], axis=1,).rename_axis("podcast_id")# Previewn_categories_per_podcast.head(2).reset_index()
podcast_id
n_categories
n_subcategories
0
c7c3acaf63bd2e97631c2701c135ae38
5
7
1
c5c2b1d7f6f0fc30f0e120a42ac7c931
5
5
3.4.3 Pre-Processing: “reviews”
The main steps in “reviews” pre-precessing:
Duplicated entries are removed.
Timestamps are converted from string to Python date-time format.
Year, month, weekday name, and hour of the day, when a podcast was created, are extracted as separate variables to investigate temporal trends and patterns.
Review title and review body length metrics (number of characters, number of words and average word length as well as ratios of review body to review title length metrics were calculated) are calculated.
Code
# Remove duplicatesdata_reviews = data_reviews.drop_duplicates()# Convert to datetime formatdata_reviews["created_at"] = ( pd.to_datetime(data_reviews["created_at"], format='mixed'))# Extract information from datesdata_reviews["year"] = data_reviews["created_at"].dt.yeardata_reviews["month"] = data_reviews["created_at"].dt.monthdata_reviews["weekday"] = pd.Categorical( data_reviews["created_at"].dt.day_name(), categories=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday", ], ordered=True,)data_reviews["hour"] = data_reviews["created_at"].dt.hour
Code
# To remove ambiguity in merged datasetdata_reviews = data_reviews.rename({"title": "review_title"}, axis=1)
Code
# Metrics of review title lengthdata_reviews["review_title_n_char"] = data_reviews["review_title"].apply(len)data_reviews["review_title_n_words"] = data_reviews["review_title"].str.count( word_pattern)data_reviews["review_title_avg_word_length"] = data_reviews.eval("review_title_n_char / review_title_n_words").replace([np.inf, np.nan], 0)# Metrics of review body lengthdata_reviews["review_body_n_char"] = data_reviews["content"].apply(len)data_reviews["review_body_n_words"] = data_reviews["content"].str.count( word_pattern)data_reviews["review_body_avg_word_length"] = data_reviews.eval("review_body_n_char / review_body_n_words").replace([np.inf, np.nan], 0)
# Check if duplicates are removedprint(f"n duplicates = {sum(data_reviews.duplicated())}")
n duplicates = 0
Code
my.count_unique(data_reviews)
n_unique
percent_unique
podcast_id
111125
5.5%
review_title
1117416
55.2%
content
2008001
99.2%
rating
5
<0.1%
author_id
1451063
71.7%
created_at
2012360
99.4%
year
18
<0.1%
month
12
<0.1%
weekday
7
<0.1%
hour
24
<0.1%
review_title_n_char
101
<0.1%
review_title_n_words
27
<0.1%
review_title_avg_word_length
767
<0.1%
review_body_n_char
3595
0.2%
review_body_n_words
888
<0.1%
review_body_avg_word_length
37487
1.9%
review_body_title_n_char_ratio
53313
2.6%
review_body_title_n_words_ratio
4725
0.2%
review_body_title_avg_word_length_ratio
297529
14.7%
Code
# Previewdata_reviews.head(2)
podcast_id
review_title
content
rating
author_id
created_at
year
month
weekday
hour
review_title_n_char
review_title_n_words
review_title_avg_word_length
review_body_n_char
review_body_n_words
review_body_avg_word_length
review_body_title_n_char_ratio
review_body_title_n_words_ratio
review_body_title_avg_word_length_ratio
0
c61aa81c9b929a66f0c1db6cbe5d8548
really interesting!
Thanks for providing these insights....
5
F7E5A318989779D
2018-04-24 12:05:16-07:00
2018
4
Tuesday
12
19
2
9.50
100
15
6.67
5.26
7.50
0.70
1
c61aa81c9b929a66f0c1db6cbe5d8548
Must listen for anyone interested in...
Super excited to see this podcast gr...
5
F6BF5472689BD12
2018-05-09 18:14:32-07:00
2018
5
Wednesday
18
48
8
6.00
178
29
6.14
3.71
3.62
1.02
3.5 Several Data Frames
3.5.1 Inspection: Several Data Frames
The analysis reveals that dataset “reviews” contains information on 1523 more unique podcasts than the remaining tables. These additional podcasts will be excluded from the analyses, in which information from more than one table is needed. In “reviews” only analyses, these data will be included.
Code
print("Number of podcasts in each dataset:")print( data_reviews.podcast_id.nunique(), data_podcasts.podcast_id.nunique(), data_categories.podcast_id.nunique(), sep=", ",)id_in_r = data_reviews.podcast_id.unique()id_in_p = data_podcasts.podcast_id.unique()id_in_c = data_categories.podcast_id.unique()print(f"""Podcast IDs:- Only in "reviews", not in "podcasts" {len(set(id_in_r).difference(id_in_p))}- Only in "reviews", not in "categories" {len(set(id_in_r).difference(id_in_c))}- Only in "podcasts", not in "reviews" {len(set(id_in_p).difference(id_in_r))}- Only in "podcasts", not in "categories" {len(set(id_in_p).difference(id_in_c))}- Only in "categories", not in "podcasts" {len(set(id_in_c).difference(id_in_p))}- Only in "categories", not in "reviews" {len(set(id_in_c).difference(id_in_r))}""")
Number of podcasts in each dataset:
111125, 109602, 109602
Podcast IDs:
- Only in "reviews", not in "podcasts" 1523
- Only in "reviews", not in "categories" 1523
- Only in "podcasts", not in "reviews" 0
- Only in "podcasts", not in "categories" 0
- Only in "categories", not in "podcasts" 0
- Only in "categories", not in "reviews" 0
3.5.2 Pre-Processing: Several Data Frames
In this section, pre-processing is performed keeping in mind 3 purposes:
to explore correlation between number of reviews and other features: this metric is calculated for each podcast, so correlation will be performed on feature values, aggregated for each podcast.
to explore the correlation between rating and other features: as this metric is for each review (i.e., one podcast can have many reviews), no aggregation was introduced here.
Explore ratings between categories.
Pre-processing to analyze the number of reviews:
Code
gr_reviews = data_reviews.groupby("podcast_id")# Aggregates of ratingrating_summary = gr_reviews["rating"].agg(["count", "mean"])rating_summary = rating_summary.set_axis(["n_reviews", "mean_rating"], axis=1)# Aggregates of other metricscols_other_metrics = ["review_title_n_char","review_title_n_words","review_title_avg_word_length","review_body_n_char","review_body_n_words","review_body_avg_word_length","review_body_title_n_char_ratio","review_body_title_n_words_ratio","review_body_title_avg_word_length_ratio",]metrics_per_podcast = gr_reviews[cols_other_metrics].agg(["mean"])# Create meaningful variable names:new_index = metrics_per_podcast.columns.to_flat_index().map(lambda x: f"{x[1]}_{x[0]}")# Merge the datasetsdata_for_n_reviews_corr = my.merge_all( [ rating_summary, metrics_per_podcast.set_axis(new_index, axis=1), n_categories_per_podcast.reset_index(),# Select ID and numeric columns from "podcasts" data_podcasts.set_index("podcast_id") .select_dtypes("number") .drop("itunes_id", axis=1) .reset_index(), ], on="podcast_id", how="inner",)# Previewprint(data_for_n_reviews_corr.shape)data_for_n_reviews_corr.head(2)
Population: Reviews of podcasts on the Apple Podcasts platform.
Code
ratings = data_reviews["rating"]print("Rating:")my.display_collapsible( my.calc_summaries(ratings), "Descriptive statistics (overall rating)")my.display_collapsible( [round(i, 3) for i in sms.DescrStatsW(ratings).tconfint_mean()],"Confidence interval of mean (overall rating)", sep=" – ",)print("Analysis of counts per rating:")counts = ratings.value_counts().sort_index()analyze_counts( counts, by="rating", title="Overall Rating", rot=0,);
Rating:
Analysis of counts per rating:
Descriptive statistics (overall rating)
count
min
max
range
mean
median
std
mad
skew
rating
2024681
1.00
5.00
4.00
4.63
5.00
1.04
0.00
-2.78
Confidence interval of mean (overall rating)4.629 – 4.631
Omnibus (chi-squared) test resultsChi square test, χ²(4, n = 2024681) = 5633146.82, p < 0.001
Review counts with 95% CI and post-hoc (pairwise chi-squared) test results
rating
n_reviews
percent
ci_lower
ci_upper
cld
spaced_cld
0
1
113,197
5.6%
5.5%
5.6%
a
a____
1
2
44,517
2.2%
2.2%
2.2%
b
_b___
2
3
50,670
2.5%
2.5%
2.5%
c
__c__
3
4
61,360
3.0%
3.0%
3.1%
d
___d_
4
5
1,754,937
86.7%
86.6%
86.7%
e
____e
Descriptive statistics of group (rating) counts
count
min
max
range
mean
median
std
mad
skew
n_reviews
5
44,517
1,754,937
1,710,420
404,936
61,360
755,160
16,843
2.23
percent
5
2.2%
86.7%
84.5%
20.0%
3.0%
37.3%
0.8%
2.23
Results. Rating scores range from 1 (worst) to 5 (best) points. The distribution of rating values is highly skewed with mean overall rating equal to 4.630 (CI 4.629–4.631). The median of ratings is 5 points and the dominant rating (which appeared in 86.7% of reviews) is also 5 points. All the remaining rating values are much less frequent (the differences in proportions between all five rating groups are significant).
4.2 Podcast Categories
Question: Which number of categories per podcast is most common? Population: Podcasts on the Apple Podcasts platform.
Code
n_podcasts_per_n_main_categories = ( n_categories_per_podcast["n_categories"].value_counts().sort_index())analyze_counts( n_podcasts_per_n_main_categories,"n_categories", counts_of="podcast", xlabel="Number of categories per podcast", title="Popularity of Categories' Count per Podcast", rot=0,);
Omnibus (chi-squared) test resultsChi square test, χ²(4, n = 109602) = 219804.47, p < 0.001
Podcast counts with 95% CI and post-hoc (pairwise chi-squared) test results
n_categories
n_podcasts
percent
ci_lower
ci_upper
cld
spaced_cld
0
1
80,188
73.2%
72.8%
73.5%
a
a____
1
2
28,090
25.6%
25.3%
26.0%
b
_b___
2
3
1,246
1.1%
1.1%
1.2%
c
__c__
3
4
75
0.1%
0.1%
0.1%
d
___d_
4
5
3
<0.1%
<0.1%
<0.1%
e
____e
Descriptive statistics of group (n_categories) counts
count
min
max
range
mean
median
std
mad
skew
n_podcasts
5
3
80,188
80,185
21,920
1,246
34,707
1,243
1.68
percent
5
<0.1%
73.2%
73.2%
20.0%
1.1%
31.7%
1.1%
1.68
Results. Podcasts have up to 5 main categories. Differences in proportions between all the groups (of the number of categories per podcast) are statistically significant. Almost 3/4-ths of podcasts (n=80,188, 73.2% CI 72.8%–73.5%) have only one category and this is the most common choice, and slightly more than 1/4-th (n=28,090, 25.6% CI 25.3%–26.0%) have 2 categories.
Question: Which number of categories per podcast is best rated? Population: Podcasts on the Apple Podcasts platform.
Code
analyze_numeric_groups("mean_rating", by="n_categories", data=data_for_n_reviews_corr, xlabel="Number of categories per podcast", ylabel="Mean rating", title="Rating of Categories' Count per Podcast",);
Omnibus (Kruskal-Wallis) test results
Source
ddof1
H
p-unc
Kruskal
n_categories
4
309.54
p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI) of ratings
n_categories
cld
spaced_cld
mean
ci_lower
ci_upper
0
1
a
a__
4.80
4.80
4.81
1
2
b
_b_
4.83
4.82
4.83
2
3
c
__c
4.84
4.82
4.86
3
4
bc
_bc
4.73
4.58
4.87
4
5
abc
abc
4.87
4.47
5.27
Descriptive statistics of group (n_categories) means
count
min
max
range
mean
median
std
mad
skew
mean
5
4.73
4.87
0.14
4.81
4.83
0.05
0.02
-1.21
Note. Even though for 5 categories per podcast the upper bound of CI is above 5 (the CI calculation method does not know what is the range of possible values of ratings), the plot shows rating limits only up to 5.
Results. Change from 1 to 3 categories per podcast shows a small increasing trend (from rating 4.80 to 4.84). For 4 and 5 categories per podcast errors are high due to the small sample size and no trends are visible here.
Question: Which number of sub-categories per podcast is the most common? Population: Podcasts on the Apple Podcasts platform.
Code
n_podcasts_per_n_sub_categories = ( n_categories_per_podcast["n_subcategories"].value_counts().sort_index())analyze_counts( n_podcasts_per_n_sub_categories,"n_subcategories", counts_of="podcast", xlabel="Number of sub-categories per podcast", ylabel="Number of podcasts", title="Popularity of Sub-Categories' Count per Podcast", rot=0,);
Omnibus (chi-squared) test resultsChi square test, χ²(7, n = 109602) = 171282.29, p < 0.001
Podcast counts with 95% CI and post-hoc (pairwise chi-squared) test results
n_subcategories
n_podcasts
percent
ci_lower
ci_upper
cld
spaced_cld
0
1
44,145
40.3%
39.9%
40.7%
a
a_______
1
2
39,373
35.9%
35.5%
36.3%
b
_b______
2
3
16,473
15.0%
14.7%
15.3%
c
__c_____
3
4
8,796
8.0%
7.8%
8.3%
d
___d____
4
5
648
0.6%
0.5%
0.7%
e
____e___
5
6
146
0.1%
0.1%
0.2%
f
_____f__
6
7
18
<0.1%
<0.1%
<0.1%
g
______g_
7
8
3
<0.1%
<0.1%
<0.1%
h
_______h
Descriptive statistics of group (n_subcategories) counts
count
min
max
range
mean
median
std
mad
skew
n_podcasts
8
3
44,145
44,142
13,700
4,722
18,309
4,712
1.09
percent
8
<0.1%
40.3%
40.3%
12.5%
4.3%
16.7%
4.3%
1.09
Results. Differences between all groups are significant. Most common are podcasts with a single subcategory (40.3%), but two-subcategory podcasts are almost equally common (35.9%).
We can notice that single-category podcasts in the sample have up to 4 sub-categories:
Question: Which categories are the most popular among Apple Podcasts’ creators? Population: Podcast categories on the Apple Podcasts platform.
Code
n_podcasts_per_category = data_categories["category"].value_counts()res_counts_category = analyze_counts( n_podcasts_per_category, by="category", counts_of="podcast", ylabel="Number of podcasts *", title="Popularity of Categories", rot=90, label_rotation=90, y_lim_max=34900,)plt.figtext(0.2,-0.5,"* The same podcast can have several categories", wrap=True, horizontalalignment="center", fontsize=8,);
Omnibus (chi-squared) test resultsChi square test, χ²(18, n = 211760) = 109953.36, p < 0.001
Podcast counts with 95% CI and post-hoc (pairwise chi-squared) test results
category
n_podcasts
percent
ci_lower
ci_upper
cld
spaced_cld
0
society/culture
27,536
13.0%
12.8%
13.2%
a
a_______________
1
religion/spirituality
23,409
11.1%
10.9%
11.3%
b
_b______________
2
business
22,390
10.6%
10.4%
10.8%
c
__c_____________
3
education
20,700
9.8%
9.6%
10.0%
d
___d____________
4
arts
16,099
7.6%
7.4%
7.8%
e
____e___________
5
comedy
14,548
6.9%
6.7%
7.0%
f
_____f__________
6
health/fitness
14,504
6.8%
6.7%
7.0%
f
_____f__________
7
sports
13,138
6.2%
6.0%
6.4%
g
______g_________
8
leisure
12,023
5.7%
5.5%
5.8%
h
_______h________
9
tv/film
10,794
5.1%
5.0%
5.2%
i
________i_______
10
news
10,755
5.1%
4.9%
5.2%
i
________i_______
11
music
7,807
3.7%
3.6%
3.8%
j
_________j______
12
kids/family
4,843
2.3%
2.2%
2.4%
k
__________k_____
13
science
3,375
1.6%
1.5%
1.7%
l
___________l____
14
technology
3,227
1.5%
1.4%
1.6%
l
___________l____
15
fiction
2,762
1.3%
1.2%
1.4%
m
____________m___
16
history
1,657
0.8%
0.7%
0.8%
n
_____________n__
17
crime
1,255
0.6%
0.5%
0.6%
o
______________o_
18
government
938
0.4%
0.4%
0.5%
p
_______________p
Descriptive statistics of group (category) counts
count
min
max
range
mean
median
std
mad
skew
n_podcasts
19
938
27,536
26,598
11,145
10,794
8,251
7,419
0.44
percent
19
0.4%
13.0%
12.6%
5.3%
5.1%
3.9%
3.5%
0.44
Results. The chi-square goodness-of-fit test shows significant differences between the popularity of categories among podcast creators (p < 0.001). So post-hoc pair-wise comparisons were performed. Category “society/culture” 13.0% (95% CI 12.8%-13.2%) was the most popular followed by “religion/spirituality” 11.1% (95% CI 10.9%-11.3%), “business” 10.6% (95% CI 10.4%-10.8%), and “education” 9.8% (95% CI 9.6%-10.0%). Differences were significant between all the groups except “comedy” and “health/fitness”, “tv/film” and “news” as well as “science” and “technology”.
Question: Which categories are best rated? Population: Podcast categories on the Apple Podcasts platform.
Code
res_rating_category = analyze_numeric_groups("mean_rating", by="category", data=data_categories_reviews_ratings, ylabel="Mean rating", title="Rating of Categories",)plt.figtext(0.2,-0.5,"* The same podcast can have several categories", wrap=True, horizontalalignment="center", fontsize=8,)plt.xticks(rotation=90);
Omnibus (Kruskal-Wallis) test results
Source
ddof1
H
p-unc
Kruskal
category
18
3039.94
p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI) of ratings
category
cld
spaced_cld
mean
ci_lower
ci_upper
0
business
a
a__________
4.89
4.88
4.90
1
religion/spirituality
a
a__________
4.88
4.87
4.89
2
health/fitness
bc
_bc________
4.86
4.85
4.87
3
education
b
_b_________
4.84
4.83
4.85
4
comedy
cd
__cd_______
4.84
4.83
4.85
5
society/culture
cd
__cd_______
4.83
4.82
4.84
6
kids/family
de
___de______
4.83
4.81
4.85
7
music
bc
_bc________
4.83
4.81
4.84
8
fiction
efg
____efg____
4.80
4.78
4.83
9
leisure
ef
____ef_____
4.80
4.79
4.81
10
arts
ef
____ef_____
4.79
4.78
4.80
11
sports
f
_____f_____
4.78
4.77
4.79
12
tv/film
h
_______h___
4.75
4.74
4.77
13
technology
fg
_____fg____
4.75
4.72
4.77
14
science
hi
_______hi__
4.71
4.68
4.74
15
history
j
_________j_
4.69
4.66
4.72
16
government
ghi
______ghi__
4.67
4.63
4.72
17
news
i
________i__
4.66
4.64
4.67
18
crime
k
__________k
4.49
4.44
4.53
Descriptive statistics of group (category) means
count
min
max
range
mean
median
std
mad
skew
mean
19
4.49
4.89
0.41
4.77
4.80
0.10
0.05
-1.47
Results. “Business” (mean rating 4.89 CI 4.88–4.90) and “religion/spirituality” (4.88 CI 4.87-4.89) are the best-rated categories (no statistically significant difference between them). They differ significantly from the other categories, but the difference from the next best categories is tiny: the difference is only at the second decimal position or rating value. The smallest average rating (4.49 CI 4.44–4.53) is in the “crime” category.
Question: Is there a relationship between the number of reviews and the average rating grouped by category?
Code
plot_scatter( res_counts_category, res_rating_category, by="category", corr="Spearman", counts_of="podcast",)plt.xlabel("Number of Podcasts per Category")plt.title("Relationship Between Metrics \nAggregated by Category");
Spearman's correlation: r = 0.77, p < 0.001
Results: average category rating and number of podcasts per category are positively correlated (Spearman r = 0.77, p < 0.001).
4.3 Temporal Trends and Patterns
The dataset contains reviews created between 2005-12-09 and 2022-12-08. This spans approximately 17 years. So in this sub-section, we investigate if there are any patterns or correlations in ratings and amount of reviews when these metrics are aggregated by time periods of various granularity (years, months, days of week, and hours).
Code
my.display_collapsible( my.count_unique(data_reviews), "Counts of unique values by variable")print("Dates of newest and oldest review:")my.as_df( data_reviews["created_at"].dt.date.agg(["min", "max"]),"date_type","value",)
Counts of unique values by variable
n_unique
percent_unique
podcast_id
111125
5.5%
review_title
1117416
55.2%
content
2008001
99.2%
rating
5
<0.1%
author_id
1451063
71.7%
created_at
2012360
99.4%
year
18
<0.1%
month
12
<0.1%
weekday
7
<0.1%
hour
24
<0.1%
review_title_n_char
101
<0.1%
review_title_n_words
27
<0.1%
review_title_avg_word_length
767
<0.1%
review_body_n_char
3595
0.2%
review_body_n_words
888
<0.1%
review_body_avg_word_length
37487
1.9%
review_body_title_n_char_ratio
53313
2.6%
review_body_title_n_words_ratio
4725
0.2%
review_body_title_avg_word_length_ratio
297529
14.7%
Dates of newest and oldest review:
date_type
value
0
min
2005-12-09
1
max
2022-12-08
4.3.1 Reviews and Ratings by Year
Question: Does the number of reviews change from year to year? Population: Reviews of podcasts on the Apple Podcasts platform.
Code
counts = data_reviews["year"].value_counts().sort_index()res_counts_y = analyze_counts( counts, by="year", title="Amount of Reviews by Year", rot=90, label_rotation=90, color=["grey", *[blue for i inrange(16)], "grey"], y_lim_max=549000,)plt.gca().legend( ["Not full year data (2005 and 2022)"], frameon=False, loc="upper left");
Omnibus (chi-squared) test resultsChi square test, χ²(17, n = 2024681) = 3097405.0, p < 0.001
Review counts with 95% CI and post-hoc (pairwise chi-squared) test results
year
n_reviews
percent
ci_lower
ci_upper
cld
spaced_cld
0
2005
208
<0.1%
<0.1%
<0.1%
a
a_________________
1
2006
3,378
0.2%
0.2%
0.2%
b
_b________________
2
2007
5,917
0.3%
0.3%
0.3%
c
__c_______________
3
2008
8,502
0.4%
0.4%
0.4%
d
___d______________
4
2009
11,173
0.6%
0.5%
0.6%
e
____e_____________
5
2010
12,880
0.6%
0.6%
0.7%
f
_____f____________
6
2011
17,631
0.9%
0.9%
0.9%
g
______g___________
7
2012
26,233
1.3%
1.3%
1.3%
h
_______h__________
8
2013
25,573
1.3%
1.2%
1.3%
i
________i_________
9
2014
36,498
1.8%
1.8%
1.8%
j
_________j________
10
2015
61,825
3.1%
3.0%
3.1%
k
__________k_______
11
2016
86,426
4.3%
4.2%
4.3%
l
___________l______
12
2017
141,448
7.0%
6.9%
7.0%
m
____________m_____
13
2018
202,969
10.0%
10.0%
10.1%
n
_____________n____
14
2019
355,218
17.5%
17.5%
17.6%
o
______________o___
15
2020
434,540
21.5%
21.4%
21.5%
p
_______________p__
16
2021
362,991
17.9%
17.8%
18.0%
q
________________q_
17
2022
231,271
11.4%
11.4%
11.5%
r
_________________r
Descriptive statistics of group (year) counts
count
min
max
range
mean
median
std
mad
skew
n_reviews
18
208
434,540
434,332
112,482
31,366
143,158
29,224
1.26
percent
18
<0.1%
21.5%
21.5%
5.6%
1.5%
7.1%
1.4%
1.26
Results. Test results showed that differences in the number of reviews by year are significant. From 2005 to 2020 the number of reviews was increasing, reaching the peak of 434,582 reviews per year in 2020 and after that a decreasing trend is visible.
We cannot do strict conclusions on years 2005 and 2022 as not full-year data was included there.
Question: Do the ratings change from year to year? Population: Reviews of podcasts on the Apple Podcasts platform.
Code
# Average ratingres_ratings_y = analyze_ratings(by="year", title="Average Rating by Year")my.ax_xaxis_integer_ticks(min_n_ticks=8, rot=90)
Omnibus (Kruskal-Wallis) test results
Source
ddof1
H
p-unc
Kruskal
year
17
14078.86
p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI) of ratings
year
cld
spaced_cld
mean
ci_lower
ci_upper
0
2005
abcdefgh
abcdefgh___
4.80
4.70
4.89
1
2006
ij
________ij_
4.58
4.55
4.62
2
2007
i
________i__
4.53
4.51
4.56
3
2008
jk
_________jk
4.59
4.57
4.61
4
2009
j
_________j_
4.57
4.55
4.59
5
2010
ak
a_________k
4.58
4.56
4.60
6
2011
b
_b_________
4.62
4.61
4.64
7
2012
c
__c________
4.70
4.69
4.71
8
2013
c
__c________
4.72
4.71
4.73
9
2014
de
___de______
4.80
4.79
4.81
10
2015
d
___d_______
4.81
4.81
4.82
11
2016
ef
____ef_____
4.78
4.77
4.78
12
2017
f
_____f_____
4.77
4.76
4.77
13
2018
g
______g____
4.74
4.73
4.74
14
2019
h
_______h___
4.62
4.61
4.62
15
2020
h
_______h___
4.60
4.60
4.61
16
2021
b
_b_________
4.56
4.56
4.57
17
2022
ajk
a________jk
4.49
4.49
4.50
Descriptive statistics of group (year) means
count
min
max
range
mean
median
std
mad
skew
mean
18
4.49
4.81
0.32
4.66
4.62
0.10
0.08
0.17
Results. Estimates of yearly mean ratings range from 4.49 to 4.81. Excluding the year 2005 when less than a month was included, the average rating was gradually increasing until reached a top estimate of 4.81 in 2015 (the difference is not significant compared to 2014, but significant compared to all other years). From 2015 till 2022 a decreasing trend is visible. Yet, no strict conclusions should be done about the year 2022, as information on 3 last weeks of December is not still included.
Question: Is there a relationship between the number of reviews and the average rating grouped by year?
Code
plot_scatter( res_counts_y.query("'2006' <= year <= '2021'"), res_ratings_y.query("'2006' <= year <= '2021'"), by="year", corr="Spearman",)
Spearman's correlation: r = 0.39, p = 0.141
Results. No significant monotonic relationship was detected between the number of reviews and average rating grouped by years. Yet the plot indicates that there might be a non-linear non-monotonic relationship.
4.3.2 Reviews and Ratings by Month of Year
Question: Does the number of reviews change from month to month? Is an annual pattern present? Population: Reviews of podcasts on the Apple Podcasts platform.
Code
data_reviews_2006_2021 = data_reviews.query("2006 <= year <= 2021")counts = data_reviews_2006_2021["month"].value_counts().sort_index()res_counts_m = analyze_counts( counts, by="month", title="Number of Reviews by Month (2006-2021)", rot=0, label_rotation=90, y_lim_max=205000,)
Omnibus (chi-squared) test resultsChi square test, χ²(11, n = 1793202) = 4541.78, p < 0.001
Review counts with 95% CI and post-hoc (pairwise chi-squared) test results
month
n_reviews
percent
ci_lower
ci_upper
cld
spaced_cld
0
1
144,131
8.0%
8.0%
8.1%
ab
ab_______
1
2
143,178
8.0%
7.9%
8.0%
a
a________
2
3
145,510
8.1%
8.1%
8.2%
bc
_bc______
3
4
145,808
8.1%
8.1%
8.2%
c
__c______
4
5
148,403
8.3%
8.2%
8.3%
d
___d_____
5
6
148,644
8.3%
8.2%
8.3%
d
___d_____
6
7
152,979
8.5%
8.5%
8.6%
e
____e____
7
8
157,771
8.8%
8.7%
8.9%
f
_____f___
8
9
160,106
8.9%
8.9%
9.0%
g
______g__
9
10
164,308
9.2%
9.1%
9.2%
h
_______h_
10
11
144,911
8.1%
8.0%
8.1%
bc
_bc______
11
12
137,453
7.7%
7.6%
7.7%
i
________i
Descriptive statistics of group (month) counts
count
min
max
range
mean
median
std
mad
skew
n_reviews
12
137,453
164,308
26,855
149,434
147,106
7,855
3,451
0.63
percent
12
7.7%
9.2%
1.5%
8.3%
8.2%
0.4%
0.2%
0.63
Results. Comparing the number of reviews by month, an extremely small increase in the first half of the year (from 8.0 to 8.3% of all reviews), a little bit more sudden increasing trend from the beginning of summer (8.3%) to October (9.2%) can be noticed. The biggest drop (to 7.7%) is in December. The difference between October and December is statistically significant.
Question: Are there any differences in ratings comparing different months of the year? Is an annual pattern present? Population: Reviews of podcasts on the Apple Podcasts platform.
Code
res_ratings_m = analyze_ratings( by="month", title="Average Rating by Month (2006-2022)", data=data_reviews_2006_2021,)my.ax_xaxis_integer_ticks(min_n_ticks=12)
Omnibus (Kruskal-Wallis) test results
Source
ddof1
H
p-unc
Kruskal
month
11
341.80
p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI) of ratings
month
cld
spaced_cld
mean
ci_lower
ci_upper
0
1
ab
ab____
4.66
4.66
4.67
1
2
c
__c___
4.67
4.67
4.68
2
3
ad
a__d__
4.65
4.64
4.65
3
4
bc
_bc___
4.66
4.66
4.67
4
5
ab
ab____
4.66
4.65
4.66
5
6
ef
____ef
4.63
4.63
4.64
6
7
ad
a__d__
4.65
4.65
4.66
7
8
de
___de_
4.64
4.64
4.65
8
9
de
___de_
4.64
4.64
4.65
9
10
de
___de_
4.64
4.64
4.65
10
11
f
_____f
4.63
4.62
4.63
11
12
f
_____f
4.63
4.62
4.63
Descriptive statistics of group (month) means
count
min
max
range
mean
median
std
mad
skew
mean
12
4.63
4.67
0.05
4.65
4.65
0.01
0.01
0.09
Results. The ratings (although between some months their differences are statistically significant) vary in a narrow range of 0.06 points.
Question: Is there a relationship between the number of reviews and the average rating grouped by month?
Results. No significant relationship was detected between the number of reviews and the average rating grouped by years.
4.3.3 Reviews and Ratings by Day of Week
Question: Is there a difference in the number of reviews between days of the week? Is a weekly pattern present? Population: Reviews of podcasts on the Apple Podcasts platform.
Code
counts = data_reviews["weekday"].value_counts().sort_index()res_counts_d = analyze_counts( counts, by="weekday", xlabel="Day of Week", title="Amount of Reviews by Day of Week",)
Omnibus (chi-squared) test resultsChi square test, χ²(6, n = 2024681) = 73683.67, p < 0.001
Review counts with 95% CI and post-hoc (pairwise chi-squared) test results
weekday
n_reviews
percent
ci_lower
ci_upper
cld
spaced_cld
0
Monday
301,345
14.9%
14.8%
15.0%
a
a_____
1
Tuesday
340,186
16.8%
16.7%
16.9%
b
_b____
2
Wednesday
344,212
17.0%
16.9%
17.1%
c
__c___
3
Thursday
332,548
16.4%
16.4%
16.5%
d
___d__
4
Friday
293,180
14.5%
14.4%
14.5%
e
____e_
5
Saturday
206,866
10.2%
10.2%
10.3%
f
_____f
6
Sunday
206,344
10.2%
10.1%
10.2%
f
_____f
Descriptive statistics of group (weekday) counts
count
min
max
range
mean
median
std
mad
skew
n_reviews
7
206,344
344,212
137,868
289,240
301,345
59,599
38,841
-0.82
percent
7
10.2%
17.0%
6.8%
14.3%
14.9%
2.9%
1.9%
-0.82
Results. The analysis of weekly patterns revealed that most reviews are created in the middle of the week (e.g., on Wednesday on average 17.0% CI 16.9-17.1% of weekly reviews are created) and least on weekends (10.2% of weekly reviews on each day). Differences are significant between all days except Sunday and Saturday.
Question: Are there any differences in ratings comparing different days of the week? Is a weekly pattern present? Population: Reviews of podcasts on the Apple Podcasts platform.
Code
res_ratings_d = analyze_ratings( by="weekday", title="Average Rating by Day of Week", xlabel="Day of Week")
Omnibus (Kruskal-Wallis) test results
Source
ddof1
H
p-unc
Kruskal
weekday
6
222.02
p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI) of ratings
weekday
cld
spaced_cld
mean
ci_lower
ci_upper
0
Monday
a
a___
4.62
4.62
4.63
1
Tuesday
b
_b__
4.63
4.63
4.64
2
Wednesday
c
__c_
4.64
4.64
4.65
3
Thursday
b
_b__
4.63
4.63
4.64
4
Friday
bc
_bc_
4.64
4.63
4.64
5
Saturday
a
a___
4.62
4.61
4.62
6
Sunday
d
___d
4.61
4.60
4.61
Descriptive statistics of group (weekday) means
count
min
max
range
mean
median
std
mad
skew
mean
7
4.61
4.64
0.04
4.63
4.63
0.01
0.01
-0.82
Results. The values of the ratings vary in a very narrow range of 0.04 points. Even though differences between some days are significant, they are practically negligible.
Question: Is there a relationship between the number of reviews and the average rating grouped by the day of the week?
Results. A significant positive relationship between the number of reviews and average rating grouped by weekday is detected (Spearman’s r = 0.79, p = 0.036). This shows that a bigger quantity of reviews is related to a better average rating.
4.3.4 Reviews by Hour of Day
Question: Is there a difference in the number of reviews between hours of the day? Is a daily pattern present? Population: Reviews of podcasts on the Apple Podcasts platform.
Code
counts = data_reviews["hour"].value_counts().sort_index()res_counts_h = analyze_counts( counts, by="hour", xlabel="Hour of Day", title="Amount of Reviews by Hour", rot=0, label_rotation=90, y_lim_max=155000,)
Omnibus (chi-squared) test resultsChi square test, χ²(23, n = 2024681) = 399296.48, p < 0.001
Review counts with 95% CI and post-hoc (pairwise chi-squared) test results
hour
n_reviews
percent
ci_lower
ci_upper
cld
spaced_cld
0
0
22,813
1.1%
1.1%
1.1%
a
a_________________
1
1
17,536
0.9%
0.8%
0.9%
b
_b________________
2
2
18,421
0.9%
0.9%
0.9%
c
__c_______________
3
3
29,007
1.4%
1.4%
1.5%
d
___d______________
4
4
48,771
2.4%
2.4%
2.4%
e
____e_____________
5
5
73,306
3.6%
3.6%
3.7%
f
_____f____________
6
6
97,458
4.8%
4.8%
4.9%
g
______g___________
7
7
115,387
5.7%
5.6%
5.7%
h
_______h__________
8
8
125,262
6.2%
6.1%
6.2%
ij
________ij________
9
9
126,314
6.2%
6.2%
6.3%
i
________i_________
10
10
126,211
6.2%
6.2%
6.3%
i
________i_________
11
11
124,544
6.2%
6.1%
6.2%
j
_________j________
12
12
119,505
5.9%
5.9%
6.0%
k
__________k_______
13
13
115,508
5.7%
5.7%
5.8%
h
_______h__________
14
14
109,907
5.4%
5.4%
5.5%
l
___________l______
15
15
103,966
5.1%
5.1%
5.2%
m
____________m_____
16
16
101,606
5.0%
5.0%
5.1%
n
_____________n____
17
17
102,698
5.1%
5.0%
5.1%
mn
____________mn____
18
18
106,078
5.2%
5.2%
5.3%
o
______________o___
19
19
101,542
5.0%
5.0%
5.1%
n
_____________n____
20
20
87,258
4.3%
4.3%
4.4%
p
_______________p__
21
21
68,585
3.4%
3.3%
3.4%
q
________________q_
22
22
49,347
2.4%
2.4%
2.5%
e
____e_____________
23
23
33,651
1.7%
1.6%
1.7%
r
_________________r
Descriptive statistics of group (hour) counts
count
min
max
range
mean
median
std
mad
skew
n_reviews
24
17,536
126,314
108,778
84,362
101,574
38,270
23,329
-0.66
percent
24
0.9%
6.2%
5.4%
4.2%
5.0%
1.9%
1.2%
-0.66
Results. Daily patterns in the number of reviews are very clear: there are two peeks (bigger at 8-11 o’clock, 6.2% of reviews each hour, and lower at around 18 o’clock, 5.2% reviews) when users are most active and one (at around 1-2 AM, 0.9% reviews each hour) when users are least active.
The time zone is UTC-7:00, so 8-11 AM in UTC-7:00 is 18 o’clock in some European countries. We do not have reviewers’ location data, so we can only assume that this activity mainly comes from Europe.
Question: Are there any differences in ratings comparing different hours of the day? Is a daily pattern present? Population: Reviews of podcasts on the Apple Podcasts platform.
Code
res_ratings_h = analyze_ratings(by="hour", title="Average Rating by Hour")my.ax_xaxis_integer_ticks(min_n_ticks=24)plt.gca().set_xlim([-0.5, 23.5]);
Omnibus (Kruskal-Wallis) test results
Source
ddof1
H
p-unc
Kruskal
hour
23
3321.07
p < 0.001
Post-hoc (Conover-Iman) test results as CLD and Confidence intervals (CI) of ratings
hour
cld
spaced_cld
mean
ci_lower
ci_upper
0
0
ab
ab_________
4.49
4.47
4.50
1
1
a
a__________
4.46
4.44
4.48
2
2
ab
ab_________
4.49
4.47
4.50
3
3
b
_b_________
4.51
4.50
4.53
4
4
c
__c________
4.54
4.53
4.55
5
5
d
___d_______
4.58
4.57
4.59
6
6
e
____e______
4.61
4.60
4.61
7
7
f
_____f_____
4.63
4.62
4.63
8
8
fg
_____fg____
4.64
4.63
4.64
9
9
fgh
_____fgh___
4.64
4.64
4.65
10
10
gh
______gh___
4.65
4.64
4.65
11
11
gh
______gh___
4.64
4.64
4.65
12
12
gh
______gh___
4.64
4.64
4.65
13
13
gh
______gh___
4.64
4.63
4.65
14
14
fgh
_____fgh___
4.64
4.63
4.64
15
15
hi
_______hi__
4.65
4.64
4.66
16
16
ij
________ij_
4.66
4.66
4.67
17
17
k
__________k
4.68
4.67
4.68
18
18
jk
_________jk
4.67
4.67
4.68
19
19
ij
________ij_
4.67
4.66
4.67
20
20
gh
______gh___
4.64
4.64
4.65
21
21
e
____e______
4.61
4.60
4.61
22
22
d
___d_______
4.58
4.57
4.59
23
23
c
__c________
4.55
4.54
4.56
Descriptive statistics of group (hour) means
count
min
max
range
mean
median
std
mad
skew
mean
24
4.46
4.68
0.22
4.60
4.64
0.06
0.03
-1.01
Results. The pattern in ratings by the hour of the day is also visible and they reflect the trends in the number of reviews: during daytime, the ratings tend to be higher and at night they are lowest.
Question: Is there a relationship between the number of reviews and average rating grouped by hour of day?
Results. A significant positive relationship between the number of reviews and average rating grouped by hour of the day is detected (Spearman’s r = 0.69, p < 0.001). This shows that a bigger quantity of reviews is related to a better average rating.
4.4 Correlation between Number of Reviews, Rating and Other Features
Question: Does the rating of reviews correlate to other features (number of categories/sub-categories per podcast, various podcast title, review title and body length metrics)? Population: Reviews of podcasts on the Apple Podcasts platform.
Spearman correlation analysis is performed.
Code
# Correlation between rating and other features:corr_result = do_corr_1_vs_other(data_for_rating_corr, var_x="rating")# Correlation between number of reviews and other features:corr_result = do_corr_1_vs_other( data_for_n_reviews_corr, var_x="n_reviews", corr_result=corr_result)# To data framecorr_result = pd.DataFrame(corr_result)# Sort by correlation strengthcorr_result = corr_result.sort_values(by="r", key=abs, ascending=False)# Control for Type 1 error: add column for p adjusted valuescorr_result["p_adj"] = [ my.format_p(i)for i in sms.multipletests(corr_result["p"], method="holm")[1]]
Correlation between rating and other features:
Code
corr_result.query("variable_1=='rating'")
variable_1
variable_2
r
p
p_adj
13
rating
review_body_n_words
-0.09
0.00
p < 0.001
12
rating
review_body_n_char
-0.09
0.00
p < 0.001
15
rating
review_body_title_n_char_ratio
-0.09
0.00
p < 0.001
16
rating
review_body_title_n_words_ratio
-0.07
0.00
p < 0.001
11
rating
review_title_avg_word_length
0.05
0.00
p < 0.001
8
rating
n_subcategories
0.05
0.00
p < 0.001
17
rating
review_body_title_avg_word_length_ratio
-0.04
0.00
p < 0.001
4
rating
podcast_title_n_char
0.03
0.00
p < 0.001
2
rating
slug_n_char_wo_hyphen
0.03
0.00
p < 0.001
0
rating
slug_n_char
0.03
0.00
p < 0.001
10
rating
review_title_n_words
-0.03
0.00
p < 0.001
5
rating
podcast_title_n_words
0.03
0.00
p < 0.001
1
rating
slug_n_hyphen
0.03
0.00
p < 0.001
14
rating
review_body_avg_word_length
0.03
0.00
p < 0.001
6
rating
podcast_title_avg_word_length
0.02
0.00
p < 0.001
9
rating
review_title_n_char
-0.01
0.00
p < 0.001
3
rating
slug_avg_word_length
0.01
0.00
p < 0.001
7
rating
n_categories
-0.00
0.00
p < 0.001
Results: Despite the fact, that it is statistically significant, the relationship between ratings and other variables of investigation is small (|r| < 0.10). The highest is between:
rating and review length (e.g., length in words, review_body_n_words) r = -0.09 (longer reviews to some extent tend to have worse ratings),
rating and word length of review title (review_title_avg_word_length) r = 0.05 (longer words in review title related to hardly noticeably better ratings) as well as
rating and number of sub-categories r = 0.05 (more sub-categories related to hardly noticeably better ratings).
Question: Does the number of reviews correlate to other features (mean podcast rating, number of categories/sub-categories per podcast, various podcast title, review title and body length metrics)? Population: Podcasts on the Apple Podcast platform.
Correlation between number of reviews and other features:
Code
corr_result.query("variable_1=='n_reviews'")
variable_1
variable_2
r
p
p_adj
18
n_reviews
mean_rating
-0.40
0.00
p < 0.001
25
n_reviews
mean_review_body_title_n_char_ratio
0.33
0.00
p < 0.001
23
n_reviews
mean_review_body_n_words
0.32
0.00
p < 0.001
26
n_reviews
mean_review_body_title_n_words_ratio
0.32
0.00
p < 0.001
22
n_reviews
mean_review_body_n_char
0.32
0.00
p < 0.001
20
n_reviews
mean_review_title_n_words
0.20
0.00
p < 0.001
28
n_reviews
n_categories
0.17
0.00
p < 0.001
19
n_reviews
mean_review_title_n_char
0.17
0.00
p < 0.001
29
n_reviews
n_subcategories
0.14
0.00
p < 0.001
31
n_reviews
slug_n_hyphen
0.06
0.00
p < 0.001
27
n_reviews
mean_review_body_title_avg_word_leng...
0.06
0.00
p < 0.001
35
n_reviews
podcast_title_n_words
0.06
0.00
p < 0.001
34
n_reviews
podcast_title_n_char
0.05
0.00
p < 0.001
21
n_reviews
mean_review_title_avg_word_length
0.05
0.00
p < 0.001
30
n_reviews
slug_n_char
0.05
0.00
p < 0.001
32
n_reviews
slug_n_char_wo_hyphen
0.05
0.00
p < 0.001
33
n_reviews
slug_avg_word_length
-0.03
0.00
p < 0.001
24
n_reviews
mean_review_body_avg_word_length
0.02
0.00
p < 0.001
36
n_reviews
podcast_title_avg_word_length
-0.01
0.00
p < 0.001
As the correlation between n_reviews and mean_rating r = -0.40 looks suspicious (contradicts the results seen before), these variables will be plotted for further investigation:
Code
# Datavars= ["n_reviews", "mean_rating"]ranks_all = data_for_n_reviews_corr[vars].rank()ranks_wo_rating_5 = ( data_for_n_reviews_corr[vars].query("mean_rating < 5").rank())# Correlationres = do_corr_1_vs_other(ranks_all, "n_reviews")res = do_corr_1_vs_other(ranks_wo_rating_5, "n_reviews", res)res = pd.DataFrame(res)res.insert(2, "subset", ["All ratings", "Rating < 5"])display(res)# Plotfig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(8, 4))# Subplot 1sns.regplot( x="n_reviews", y="mean_rating", data=ranks_all, scatter_kws={"alpha": 0.01}, line_kws={"color": "black"}, ax=ax1,)ax1.set_xlabel("Rank of number of reviews")ax1.set_ylabel("Rank of average rating")ax1.set_title("All ratings")my.ax_axis_comma_format(ax=ax1)# Subplot 2sns.regplot( x="n_reviews", y="mean_rating", data=ranks_wo_rating_5, scatter_kws={"alpha": 0.05}, line_kws={"color": "black"}, ax=ax2,)ax2.set_xlabel("Rank of number of reviews")ax2.set_ylabel("Rank of average rating")ax2.set_title("Ratings < 5")my.ax_axis_comma_format(ax=ax2)# Commonplt.suptitle("Relationship between Ranks of Mean Rating and Reviews Amount")plt.tight_layout();
variable_1
variable_2
subset
r
p
0
n_reviews
mean_rating
All ratings
-0.40
0.00
1
n_reviews
mean_rating
Rating < 5
0.60
0.00
Results: In general, the number of reviews seems to have larger correlation coefficient estimates to some variables than ratings to other variables.
The largest correlation was detected between the number of reviews per podcast and mean podcast rating but after graphical investigation, it turned out that if the largest ratings are removed, the sign of correlation coefficient (and the slope of the regression line) changes from negative (Spearman’s r = -0.40, p_adj < 0.001) to positive (r = 0.60, p < 0.001, p_adj was not calculated). This means that a high proportion of top ratings (equal values) distorted the results and indicates contradicting trends when these values are included and excluded.
A graphical investigation of a few more variables with top correlation did not reveal this kind of discrepancy (the plots are not included in the report).
The next biggest correlations were detected between:
amount of reviews and various review and review title length metrics (e.g., mean_review_body_title_n_char_ratio = 0.33, mean_review_body_n_words r = 0.32, mean_review_title_n_words r = 0.20) that essentially show that longer reviews and review titles as well as the bigger difference between review and review title length are related to a larger number of reviews.
the number of reviews and the number of categories (r = 0.17).
5 Looker Studio Dashboard
A part of this project was to create a dashboard with at least 3 charts in Looker Studio. So a tool that allows us to interactively explore temporal trends and patterns of ratings and number of reviews in various subsets of data was created and it is accessible here. The dashboard consists of 2 pages:
The first page contains plots with Y-axis for ratings ranging from 1 to 5. This version of the dashboard is more flexible for interactive analysis as a wider range of possible mean rating values can be visualized.
The second page contains the same plots just with zoomed rating values. This feature allows seeing (possible) patterns in ratings where differences between the values are small. But this version of the dashboard is less friendly for interactive exploration.
A preview of the dashboards can be seen in the print screens below:
blue columns represent the number of reviews and
red connected dots indicate the average rating.
Due to the 100 MB per dataset limitations of Looker Studio, only a subset of data (~92% of reviews from the data_all dataset) is used in the dashboard.
6 Main Findings
Sample size:
More than 100 thousand podcasts and more than 2 million reviews are included in the analysis.
Ratings (overall):
The distribution of ratings is asymmetric: the top rating (5 points) is the most common (in 87% of reviews);
The median rating is 5 stars too;
The mean overall rating is 4.63.
The analysis of categories:
There are 19 main categories of podcasts (after merging similar sub-categories) and 110 sub-categories;
Each podcast has up to 5 main categories;
Single-category podcasts are most common (73% of all podcasts) but, compared to 2- and 3- category-per-podcast cases, are a little bit less well-rated (rating 4.80 vs. 4.83 and 4.84).
Average category rating and number of podcasts per category are positively correlated (Spearman r = 0.77, p < 0.001): categories with more reviews have better ratings (and vice versa).
The most common categories among podcast creators are:
“society/culture” (13% of podcasts),
“religion/spirituality” (11%),
“business” (11%), and
“education” (10%).
The least common categories are:
“history”, “crime” and “government” (< 1% each).
The best-rated categories are “business” (rating is 4.89) and “religion/spirituality” (4.88).
The next best-rated categories differ from the best ones only by the second decimal digit (the difference is negligible although it is statistically significant).
The worst-rated category is “crime” (4.49).
Temporal patterns and trends:
Data covers the period from December 2005 to December 2022 (17 years).
Difference between minimum and maximum ratings and amount of reviews as well as a correlation between these two metrics (ns – not significant):
Comparing
Difference in Ratings
Difference in Amount of Reviews
Spearman’s Correlation
Years (2006-2021)
0.28
23.3 %
ns
Months of year
0.05
1.5 %
ns
Days of week
0.04
6.8 %
r = 0.79, p = 0.036
Hours of day (in UTC-07:00)
0.22
5.4 %
r = 0.69, p < 0.001
Year-to-year trends:
Clear trends in both the amount of reviews and ratings are visible:
yearly number of reviews was increasing till 2020 after that it is dropping.
the average yearly rating was increasing till 2014-2015 after that it is dropping.
Month-to-month patterns:
Overall, the largest amount of monthly reviews (9.2%) is created in October and the smallest (7.7%) in December.
Patterns of ratings are significant yet negligible.
Weekly patterns (by the day of the week):
The largest amount of reviews per day are created in the middle of a week (e.g., on Wednesday 17.0% of weekly reviews) and the smallest amount is on weekends (10.2% on each day).
Despite the fact that the range in ratings is small, a statistically significant relationship between the number of reviews and average rating grouped by weekday is detected (see table above).
Daily patterns (by hour of day):
Patterns in the number of reviews and ratings are very clear: during the day time of indicated time zone, more reviews per hour are created and better average ratings are given, while at night time (especially around 1 and 2 AM) fewer reviews are created and ratings are worse.
The relationship between the 2 metrics is statistically significant (see table above).
Correlation analysis:
Between reviews’ ratings and other characteristics (including review, review title, podcast title length metrics, number of categories, and sub-categories), a statistically significant but extremely low correlation (Spearman’s |r| < 0.10) is detected.
The correlation between the number of reviews per podcast and mean podcast rating gave contradicting results so they will not be further commented.
The correlation between the number of reviews per podcast and review length metrics was significant and positive (up to r = 0.33) showing that podcasts with more reviews tend to have longer reviews.
The correlation between the number of reviews and the number of categories per podcast is significant and positive (r = 0.17).
7 Limitations and Suggestions for Improvement
Some ideas on how the analysis may be improved and what else can be analyzed:
Confidence intervals and statistical hypothesis testing results could be displayed on a plot.
In the analysis of temporal patterns and trends, periods of bigger granularity may influence patterns of lower granularity (e.g., weekly patterns may differ from year to year). This was not investigated.
A more detailed analysis of separate podcasts and their characteristics was not performed.
A more detailed analysis of separate users/reviewers was not performed.
A more detailed analysis of the review text, review title, and podcast title may be performed, e.g.:
Language identification.
Analysis of emojis.
Sentiment analysis.
The time duration between the newest and oldest review of each podcast may be analyzed.
Dashboard:
A more detailed description of the dashboard and used variables could be provided.
Filters for categories are not included in the dashboard.
Only a fraction of the data analyzed in Python was included in the dashboard (due to technical reasons).