Free AI web copilot to create summaries, insights and extended knowledge, download it at here

23428

Abstract

s="hljs-comment"># count the number of previous nominations and wins for the current name previous_nom = sum([1 for x in previous_nominations.get(name, []) if x < year]) previous_win = sum([1 for x in previous_wins.get(name, []) if x < year])

<span class="hljs-keyword">else</span>:
    <span class="hljs-comment"># if the name hasn't appeared in previous ceremonies, set the counts to 0</span>
    previous_nom = 0
    previous_win = 0

<span class="hljs-comment"># update the previous nominations and wins dictionaries for the current name and ceremony year</span>
previous_nominations.setdefault(name, []).append(year)
<span class="hljs-keyword">if</span> row[<span class="hljs-string">"winner"</span>] == True:
    previous_wins.setdefault(name, []).append(year)

<span class="hljs-comment"># assign the previous nomination and win counts to new columns in the dataframe</span>
df.at[index, <span class="hljs-string">"previous_nominations"</span>] = previous_nom
df.at[index, <span class="hljs-string">"previous_wins"</span>] = previous_win

# convert the columns to integers df["previous_nominations"] = df["previous_nominations"].astype(int) df["previous_wins"] = df["previous_wins"].astype(int)</pre></div>The code above goes through each nominee’s name and counts the number of times they have been nominated in the past and the number of times they have won in the past.Total number of nominations for each film may also be relevant. Films with many nominations across categories historically tend to have a higher likelihood of winning Best Picture. This was not a feature that ChatGPT originally suggested that we include, but an observation of my own. I asked ChatGPT if this might be useful to include and it agreed.To include this feature in our data, we use this code:<div id="f98a"><pre>df['other_noms'] = df.groupby(['film','year_film'])['category'].transform('nunique')</pre></div><h2 id="e6eb">Using OMDB to get critical and audience reception data</h2>Now we need critical and audience reception data. While TMDB has a popularity field, we want audience reception from IMDb and critical reception from Rotten Tomatoes and Metacritic. The TMDB does give us the IMDb ID, which we can use to get IMDb, Rotten Tomatoes, and Metacritic data using the <a href="https://www.omdbapi.com/">OMDB API.</a><div id="e3b7"><pre># Set the OMDB API key api_key = config.omdb_api_key

# Create an empty list to store the scores for each movie scores = []

# Loop through each row in the dataframe and search for the movie using the OMDB API for index, row in df.iterrows(): # Extract the IMDB ID from the row imdb_id = row["imdb_id"]

<span class="hljs-comment"># Make a request to the API and parse the response as JSON</span>
response = requests.get(<span class="hljs-string">f"http://www.omdbapi.com/?i=<span class="hljs-subst">{imdb_id}</span>&amp;apikey=<span class="hljs-subst">{api_key}</span>"</span>)
data = response.json()

<span class="hljs-comment"># Extract the ratings from the response</span>
ratings = {<span class="hljs-string">"imdb_rating"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"rotten_tomatoes_rating"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"metacritic_rating"</span>: <span class="hljs-literal">None</span>}
<span class="hljs-keyword">for</span> rating <span class="hljs-keyword">in</span> data.get(<span class="hljs-string">"Ratings"</span>, []):
    <span class="hljs-keyword">if</span> rating[<span class="hljs-string">"Source"</span>] == <span class="hljs-string">"Rotten Tomatoes"</span>:
        ratings[<span class="hljs-string">"rotten_tomatoes_rating"</span>] = rating[<span class="hljs-string">"Value"</span>]
    <span class="hljs-keyword">elif</span> rating[<span class="hljs-string">"Source"</span>] == <span class="hljs-string">"Metacritic"</span>:
        ratings[<span class="hljs-string">"metacritic_rating"</span>] = rating[<span class="hljs-string">"Value"</span>]
    <span class="hljs-keyword">elif</span> rating[<span class="hljs-string">"Source"</span>] == <span class="hljs-string">"Internet Movie Database"</span>:
        ratings[<span class="hljs-string">"imdb_rating"</span>] = rating[<span class="hljs-string">"Value"</span>]

<span class="hljs-comment"># Append the ratings to the scores list</span>
scores.append({
    <span class="hljs-string">"imdb_id"</span>: imdb_id,
    **ratings
})

# Create a pandas dataframe from the list of scores scores_df = pd.DataFrame(scores)

# Merge the scores dataframe with the original dataframe on imdb_id df = pd.merge(df, scores_df, on="imdb_id", how="left")</pre></div>You’ll need to get an API key from OMDB to use the API, and depending on how many queries you do, you may need to pay a small monthly fee.<h2 id="27ed">Breaking categorical variables into independent features</h2>We want to include the “genre” field in our analysis, but right now, each film has multiple genres associated with it — drama, romance, and action, for example. These genres are listed as dicts in the genre column for each film (see screenshot below).<figure id="0e2c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*M77U0P_z1s-MziKD79PcAA.png"><figcaption></figcaption></figure>We need to create new columns for each genre. Here is the code that ChatGPT generated to do this:<div id="eca5"><pre>from ast import literal_eval

convert the 'genres' column from a string to a list of dictionaries

df['genres'] = df['genres'].apply(literal_eval)

create a set of all unique genre names

genre_set = set() for genres in df['genres']: for genre in genres: genre_set.add(genre['name'])

create a new column for each genre

for genre in genre_set: df[genre] = 0

set the value to 1 for each genre that applies to the movie

for index, row in df.iterrows(): genres = row['genres'] for genre in genres: df.at[index, genre['name']] = 1

drop the original 'genres' column

df.drop(columns=['genres'], inplace=True)</pre></div>Likewise, there is a column called “original_language”. We want to create new columns for each potential language.<div id="567b"><pre># create a set of all unique original language names language_set = set(df['original_language'])

for language in language_set: df[language] = 0

set the value to 1 for each original language that applies to the movie

for index, row in df.iterrows(): language = row['original_language'] df.at[index, language] = 1

drop the original 'original_language' column

df.drop(columns=['original_language'], inplace=True)</pre></div>After creating these genre columns, I wanted to explore some patterns historically. Not surprisingly, the genre with the most nominations and the most wins is Drama.<figure id="9232"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ZfuWyKWwSmzooy0ZoIsbIQ.png"><figcaption></figcaption></figure>I also wanted to see the portion of nominees that actually end up winning. While Drama is the most nominated and winning category, Family and Music have the highest winning percentage given nominations at 25 percent each.<figure id="dc6d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*H0x9TWdGgZEc7wccRGWIkQ.png"><figcaption></figcaption></figure>What does this mean for this year’s nominees? For one, no Science Fiction film has ever won Best Picture, and both Everything Everywhere All at Once and Avatar: The Way of Water are categorized as Science Fiction. The Music genre has a good track record of winning Best Picture when they are nominated (6 wins out of 24 nominations). Both Tár and Elvis are categorized as Music.Rather than keeping all of the genres as their own features for the model, under ChatGPT’s guidance, we separated them into the major genres and “other”, defined as genres that have a frequency of less than 5 percent.<div id="5b22"><pre>genre_columns = ['Documentary', 'Horror', 'War', 'Thriller', 'Comedy', 'Family', 'TV Movie', 'Music', 'Western', 'Adventure', 'Drama', 'Romance', 'Action', 'Science Fiction', 'Fantasy', 'History', 'Mystery', 'Animation', 'Crime']

# Calculate the frequency of each genre genre_freq = df[genre_columns].sum() / len(df)

# Create a list of genres to be included in the "Other" category other_genres = list(genre_freq[genre_freq < 0.05].index)

# Add up the columns of genres that are frequently occurring frequent_genres = [col for col in genre_columns if col not in other_genres] df['Frequent_Genres'] = df[frequent_genres].sum(axis=1)

# Set the value of the "Other" category to 1 if any of the infrequent genres are present df['Other'] = df[other_genres].max(axis=1)

# Drop the original genre columns df.drop(columns=genre_columns, inplace=True)</pre></div><h1 id="4cd3">Building and training the model</h1>Now we can finally build and train the model. When I asked ChatGPT which model we should use, it responded:<blockquote id="3e9e">There are several models that can be used to predict a binary outcome, including:</blockquote><blockquote id="5e5c">Logistic Regression: This is a widely used model for binary classification problems. It models the probability of the binary outcome based on the input features and uses a logistic function to convert the output to a probability.</blockquote><blockquote id="b9dc">Decision Trees: Decision trees can be used for binary classification problems as well. They work by recursively splitting the dataset based on the input features until a certain stopping criterion is met.</blockquote><blockquote id="a555">Random Forest: Random forests are an extension of decision trees, where multiple trees are built and combined to improve the accuracy and prevent overfitting.</blockquote><blockquote id="384d">Support Vector Machines (SVM): SVM is a powerful model that can be used for binary classification problems. It finds the best hyperplane that separates the two classes with the maximum margin.</blockquote><blockquote id="5b74">Neural Networks: Neural networks can be used for binary classification problems as well. A neural network can have a single output node with a sigmoid activation function, which outputs a probability estimate.</blockquote><blockquote id="1eb9">The choice of the model depends on various factors such as the size of the dataset, the complexity of the problem, the interpretability of the model, and the available computational resources. It is always a good practice to try different models and compare their performance on the validation set before selecting the best one.</blockquote>I decided to just use a logistic regression model. We split the data into a training data set and a test data set, using an 80–20 split. Then we separated films made in 2022 because that is what we are going to try to predict. We also used recursive feature elimination with cross-validation to limit the total number of features in our model. Here is the code that ChatGPT generated, with significant guidance and iteration from yours truly:<div id="b7c3"><pre>import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import RFECV from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score, recall_score, f1_score from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split

# Define the feature columns feature_cols = ['budget', 'revenue', 'runtime', 'previous_nominations','previous_wins', 'imdb_rating', 'rotten_tomatoes_rating', 'metacritic_rating', 'Frequent_Genres', 'genre', 'english_language']

# Define the target column target_col = 'winner'

# Define a function to convert the rating strings to float values def convert_rating(x): if pd.isnull(x): return x elif isinstance(x, str): parts = x.split("/") if len(parts) == 1: return float(x.strip('%')) / 10 elif len(parts) == 2: return float(parts[0]) / float(parts[1]) else: return x

# Apply the function to the rating columns df['imdb_rating'] = df['imdb_rating'].apply(convert_rating) df['rotten_tomatoes_rating'] = df['rotten_tomatoes_rating'].apply(convert_rating) df['metacritic_rating'] = df['metacritic_rating'].apply(convert_rating)

# Filter out the 2022 data for prediction predict_df = df[df['year_film'] == 2022].dropna(subset=feature_cols)

# Split the data into training and test sets for all previous years df = df.dropna(subset=feature_cols + [target_col]) train_df, test_df = train_test_split(df[df['year_film'] < 2022]. dropna(subset=feature_cols + [target_col]), test_size=0.2, random_state=42, stratify=df[df['year_film'] < 2022][target_col])

train_df["winner"] = train_df["winner"].apply(lambda x: 1 if x == True else 0) test_df["winner"] = test_df["winner"].apply(lambda x: 1 if x == False else 0)

# Scale the features scaler = StandardScaler() train_df[feature_cols] = scaler.fit_transform(train_df[feature_cols]) test_df[feature_cols] = scaler.transform(test_df[feature_cols]) predict_df[feature_cols] = scaler.transform(predict_df[feature_cols])

# Recursive feature elimination with cross-validation estimator = LogisticRegression(multi_class='multinomial', solver='lbfgs') selector = RFECV(estimator, step=1, cv=100) selector = selector.fit(train_df[feature_cols], train_df[target_col])

# Select the top features selected_features = [f for f, s in zip(feature_cols, selector.support_) if s] print(selected_features)

<span class="hljs-

Options

comment"># Train the model model = LogisticRegression(multi_class='multinomial', solver='lbfgs') model.fit(train_df[selected_features], train_df[target_col])

# Predict the test set and calculate the accuracy test_df['predicted_winner'] = model.predict(test_df[selected_features]) test_df['predicted_prob'] = model.predict_proba(test_df[selected_features])[:,1] predict_df['predicted_prob'] = model.predict_proba(predict_df[selected_features])[:,1]

# Convert the predicted probabilities to binary predictions using a threshold value of 0.5 test_df['predicted_winner_binary'] = test_df['predicted_prob'].apply(lambda x: 1 if x >= 0.05 else 0)

# Calculate the accuracy, precision, recall, and F1 score based on the binary predictions accuracy = accuracy_score(test_df['winner'], test_df['predicted_winner_binary']) precision = precision_score(test_df['winner'], test_df['predicted_winner_binary']) recall = recall_score(test_df['winner'], test_df['predicted_winner_binary']) f1 = f1_score(test_df['winner'], test_df['predicted_winner_binary'])

# Print the results print(f"Accuracy: {accuracy:.2f}") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1 score: {f1:.2f}")</pre></div>The code above, not surprisingly, took the most time to write. ChatGPT was able to immediately generate a piece of code that created a logistic regression model for the data, but it wasn’t perfect. The “ratings” columns, for example, hadn’t yet been converted to float objects, so they couldn’t be used in the model. There was no recursive feature elimination in the first version either, it just built a model that used every variable. There was no accuracy, precision, recall or F1 scores in the first version; I had to ask ChatGPT how we could validate our model.The features selected for the model are: budget, revenue, runtime, previous_nominations, previous_wins, imdb_rating, metacritic_rating, genre, other_noms, and english_language. Overall, the model performs relatively well with an accuracy of 0.45, precision of 0.72, recall of 0.54, and an F1 score of 0.62.I asked ChatGPT to interpret these results. It described what each of these measures represents and then said:<blockquote id="b812">In summary, the model has a low accuracy, but a high precision, indicating that when it predicts a movie will win Best Picture, it is often correct. However, the model’s recall is relatively low, meaning that it does not identify all of the actual Best Picture winners in the test set. The F1 score is also relatively low, suggesting that the model’s performance is moderate overall.</blockquote>To see the coefficients associated with each feature we use this code:<div id="ba13"><pre># Print the coefficients and their corresponding feature names for feature, coef in zip(selected_features, model.coef_[0]): print(f"{feature}: {coef:.2f}")</pre></div>The coefficients generally make intuitive sense to me. Revenue, runtime, previous_wins, imdb_rating, metacritic_rating, other_noms, and english_language are all positively correlated with winning, while budget, previous_nominations, and genre are all negatively correlated.<figure id="959a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*urbSXNgAfAZFrFL7-x7ngA.png"><figcaption></figcaption></figure>Why is previous_nominations negatively correlated with winning? I asked ChatGPT to help me understand these coefficients:<blockquote id="02bc">budget: Negative (-0.58). This means that as the budget of a movie increases, the likelihood of it winning Best Picture decreases. This may be because bigger budget movies are often seen as more commercial and less “artistic”, which is what the Academy tends to reward.</blockquote><blockquote id="8ab5">genre: Negative (-0.23). This means that certain genres are less likely to win Best Picture than others. It’s possible that the Academy has certain biases or preferences when it comes to genres.</blockquote><blockquote id="41c8">revenue: Positive (0.26). This means that as the revenue of a movie increases, the likelihood of it winning Best Picture increases. This could be because movies that are financially successful are often seen as more popular and more culturally significant.</blockquote><blockquote id="802c">runtime: Positive (0.10). This means that as the runtime of a movie increases, the likelihood of it winning Best Picture increases. This could be because longer movies are often seen as more ambitious and more artistically significant.</blockquote><blockquote id="6113">previous_nominations: Negative (-0.28). This means that as the number of previous nominations a movie has received increases, the likelihood of it winning Best Picture decreases. This may be because the Academy tends to spread the awards around and reward new films or filmmakers rather than ones that have already been recognized in the past.</blockquote><blockquote id="8ec7">previous_wins: Positive (0.10). This means that as the number of previous wins a movie has received increases, the likelihood of it winning Best Picture increases. This could be because the Academy tends to reward films or filmmakers that have a history of success.</blockquote><blockquote id="183e">imdb_rating: Positive (0.25). This means that as the IMDB rating of a movie increases, the likelihood of it winning Best Picture increases. This could be because high IMDB ratings indicate popular and critically acclaimed movies.</blockquote><blockquote id="0746">metacritic_rating: Positive (0.11). This means that as the Metacritic rating of a movie increases, the likelihood of it winning Best Picture increases. This could be because high Metacritic ratings indicate critically acclaimed movies.</blockquote><blockquote id="ff10">english_language: Positive (0.12). This means that movies in English are more likely to win Best Picture than movies in other languages. This may be because the majority of Academy voters are English-speaking and may have biases towards movies in their own language.</blockquote><blockquote id="7eae">other_noms: Positive (0.56). This means that as the number of nominations a movie receives in other categories increases, the likelihood of it winning Best Picture increases. This makes sense given what we discussed earlier about the correlation between total nominations and Best Picture wins.</blockquote>To me, the response above is very helpful. It may or may not be “correct”, but as a data scientist trying to make sense of coefficients, having potential explanations is useful to think through. These explanations may lead to more questions, more data exploration, and model refinement, which will make for more thorough overall analysis.<h2 id="521e">Make predictions and analyze results</h2>Now we want to make predictions for this year’s Oscars and view the contribution of each of these features towards the probability of winning.<div id="e1f8"><pre># Create a table with the features, coefficients, and contribution to the probability of winning selected_columns = ['name', 'film', 'category'] + selected_features predict_df = predict_df[selected_columns].copy()

coef = pd.Series(model.coef_[0], index=selected_features) predict_df["predicted_winner"] = model.predict(predict_df[selected_features]) predict_df["predicted_prob"] = model.predict_proba(predict_df[selected_features])[:, 1]

contributions = [] for idx, row in predict_df.iterrows(): name = row['name'] film = row['film'] category = row['category'] for feature, coef_val in coef.iteritems(): feature_value = row[feature] contribution = feature_value * coef_val contributions.append([name, film, category, feature, coef_val, feature_value, contribution, row['predicted_prob']])

contributions_df = pd.DataFrame(contributions, columns=['name', 'film', 'category', 'feature', 'coefficient', 'feature_value', 'contribution', 'predicted_prob'])

contributions_df = pd.DataFrame(contributions_df. groupby(['film','feature'])['predicted_prob','contribution'].mean()).sort_values(by='predicted_prob',ascending=False)

contributions_df = contributions_df.reset_index()</pre></div>Here are final results:<figure id="f677"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*i5Ch0yOLyORnEU1KYpRfdw.png"><figcaption></figcaption></figure>These numbers indicate the probability of winning for each of the films. Everything Everywhere All at Once (EEAAO), has the highest chance of winning at 26.9 percent, followed by The Banshees of Inisherin at 25.3 percent. It should be noted that the accuracy, recall, precision and F1 scores listed above were created using a threshold of 5 percent, meaning that there are actually four films with a non-negligible chance of winning: EEAAO, The Banshees of Inisherin, Tár, and All Quiet on the Western Front.Now we have a DataFrame of all the films, their probability of winning Best Picture, and the relative contribution of each of the features. To view this data in a more intuitive way, we made a heat map:<div id="fe1c"><pre>import seaborn as sns import matplotlib.pyplot as plt from matplotlib.colors import SymLogNorm

# pivot the dataframe to get the contributions by feature for each film contributions_by_film = contributions_df.pivot(index='film', columns='feature', values='contribution')

# sort the films by predicted_prob in descending order contributions_by_film = contributions_by_film.loc[contributions_df.sort_values('predicted_prob', ascending=False)['film'].unique()]

# create a heatmap fig, ax = plt.subplots(figsize=(10, 10)) cmap = sns.diverging_palette(10, 250, s=75, l=40, n=9, center='light', as_cmap=True) norm = SymLogNorm(linthresh=0.03, linscale=0.03, vmin=-1, vmax=1) sns.heatmap(contributions_by_film, cmap=cmap, annot=True, fmt='.2f', linewidths=.5, norm=norm) ax.set_title('Contributions by Feature for Each Film') plt.show()</pre></div>The films are ranked with the highest probability of winning at the top. The columns correspond to the different features we have used in the model. The value and color of each cell indicates the contribution towards the probability of winning of each of the factors to each of the films.<figure id="4bdf"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Z6V97PMD7lsJAdbmSPLoeQ.png"><figcaption></figcaption></figure>Before we try to interpret these findings ourselves, let’s hear ChatGPT’s interpretation:<blockquote id="abb9">Firstly, the number of previous nominations and wins a film has received is positively correlated with its probability of winning. This could explain why Everything Everywhere All at Once and The Banshees of Inisherin are ranked the highest, as both films have received numerous nominations in our data.</blockquote><blockquote id="3484">Secondly, the number of other nominations a film has received is also positively correlated with its probability of winning. This could explain why All Quiet on the Western Front is ranked higher than The Fabelmans and Elvis, as it has received more nominations in our data.</blockquote><blockquote id="cca4">Thirdly, the revenue a film generates is positively correlated with its probability of winning. This could explain why Tár is ranked higher than The Fabelmans, Elvis, Top Gun: Maverick, Triangle of Sadness, and Women Talking, as it has generated more revenue in our data.</blockquote><blockquote id="e5aa">Lastly, some factors may be more difficult to interpret, such as the negative correlation between budget and winning. It’s possible that higher budget films are perceived as being more commercial or less artistic, leading to a lower probability of winning in some cases.</blockquote>Again, I find this interpretation insightful both because of what it seems to get right and because of its inaccuracies. Firstly, the number of previous nominations is at the person level — a film cannot have been nominated in previous years. Also, the number of previous nominations is actually negatively correlated with winning, so that explanation doesn’t make sense. The second point is true — the number of nominations a film has is positively correlated with winning Best Picture and All Quiet on the Western Front has more nominations than almost any other movie (other than EEAAO). Third, while it is true that revenue is positively correlated with the probability of winning, that is not why Tár is ranked higher than Top Gun: Maverick.Top Gun is one of the highest grossing films of the year. It made nearly 1.5 billion USD — 8.75 times its budget and as Steven Spielberg <a href="https://people.com/movies/steven-spielberg-says-tom-cruise-saved-hollywoods-ass-with-top-gun-maverick/">told Tom Cruise</a>, “You saved Hollywood’s ass. And, you might have saved theatrical distribution. Seriously. Maverick might have saved the entire theatrical industry.” The only movie on this list that had a higher revenue was Avatar: The Way of Water, with a total revenue of nearly 2.3 billion USD. But the budget on Avatar was 460 million USD, so the return was only about five times the budget.In another response, ChatGPT suggested that Tár is ranked so high because it is a documentary and documentaries are more likely to win Best Picture. But Tár is not a documentary. It is about the fictional orchestra conductor Lydia Tár. And even if it were a documentary, no documentary has ever won Best Picture at the Oscars, so that interpretation doesn’t make sense.The real reason EEAAO is ranked number one is because it has received 11 total nominations at this year’s Academy Awards. EEAAO is also incredibly popular according based on its IMDb rating, second only to Top Gun: Maverick. It has a relatively high Metacritic rating (fourth highest of the nominees) and the producers have never received any Oscar nominations in the past — remember that as the number of previous nominations increases, the likelihood of winning decreases.The big factor working against EEAAO is that it is categorized as science fiction, and historically, no science fiction film has ever won Best Picture. In fact, only eight films categorized by one of the less frequent genres have ever won Best Picture. If you are curious, here is a list of those eight films. It is worth noting that The Shape of Water, often considered a science fiction film, is actually classified as fantasy according to TMDB.<figure id="5e76"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hnuBRCXI9uHiUmcQD3AL8Q.png"><figcaption></figcaption></figure>The Banshees of Inisherin is ranked number two because it has nine total nominations, is relatively low budget (20 million USD), it has a pretty high IMDb rating, and a very high Metacritic rating (second only to Tár). The relatively short runtime (114 minutes) and low revenue (41 million USD) lowered its overall probability some.All Quiet on the Western Front is third because it also received nine nominations, is also lower budget (also 20 million USD), was made by previously unrecognized producers, and also has a pretty high IMDb rating. But it is in German and only two non-english films have ever won Best Picture: Parasite and The Artist.Rounding out the fourth spot is Tár. Tár has the highest Metacritic score of all the nominees but only has six total nominations. However, CODA won Best Picture last year with only 3 nominations. If we exclude the ‘other nominations’ data Tár is the most likely to win.All four films above have a non-negligible chance of winning. The rest are not critically acclaimed enough (Elvis, Triangle of Sadness, Avatar: The Way of Water), are not popular enough (Elvis, Triangle, Women Talking), are too high budget (Elvis, Top Gun, Avatar, The Fabelmans, Women Talking), do not have high enough revenue (Fabelmans, Triangle, Women Talking), too few nominations (Top Gun, Triangle, Women Talking, Avatar), or the producers have already been significantly recognized (Fabelmans).Out of curiosity, I ran the model against last year’s nominees, and here are the results:<figure id="b4c0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*KlPPX_5z-JF-PLjohEWRQw.png"><figcaption></figcaption></figure>It would have predicted that The Power of the Dog would win, and that CODA, the film that did end up winning, would have almost no chance of winning. That is largely because CODA had only three total nominations, compared to the twelve nominations for The Power of the Dog.Since the total number of nominations has such a big factor on the results, I ran the model again without this feature. In that version of the model, CODA was predicted to win last year, and Tár is predicted to win this year.<h1 id="7ad3">Conclusion</h1>ChatGPT is an incredible tool for a data scientist but there are definitely some shortcomings. It can’t take over the work entirely — there still needs to be a human guiding the analysis, for now.For the final predictions, it is important to remember the factors that we are not including in this model: industry buzz (early reviews, other awards, etc.), political and social issues, and marketing and distribution. It is impossible to know what effect those factors would have on this model without including the data.In terms of other awards, The Fabelmans won the Golden Globe for Drama Motion Picture, The Banshees of Inisherin won the Golden Globe for Musical or Comedy Motion Picture, All Quiet on the Western Front won the BAFTA for Best Film, and EEAAO won Outstanding Cast.Personally, I, as a human, think it would not be surprising if any of the top four films (EEAAO, Banshees, All Quiet, or Tár) won.</article></body>

Image generated using Midjourney: https://www.midjourney.com/

Using ChatGPT to predict the Oscars

How to use ChatGPT to build a model to predict who will win Best Picture at the 2023 Academy Awards

I used ChatGPT to build a model to predict the winner of the Best Picture category at the 2023 Academy Awards. Over the past couple of weeks, I have worked on this project with ChatGPT. I will share the code we generated, the decision-making process behind it, and the final results.

The purpose of this article is threefold:

Assess the strengths and weaknesses of working with ChatGPT on a data science project.
Provide a tutorial for others to follow who are interested in exploring and modeling Oscar nominees.
Generate some insights and takeaways from the data and model itself.

Notes: If you are using this for purpose 2, see the notebook on my GitHub for the complete code. Also, some of these chunks of code can take a long time to run.

First, here are my thoughts on the benefits and areas for improvement for ChatGPT, based on my experience on this project.

Strengths:

ChatGPT is an amazing tool that speeds up analysis and problem-solving. I would not have been able to do this work as quickly without ChatGPT.
As strange as it sounds, doing this kind of work with ChatGPT was more enjoyable than doing it alone. Having a tool to talk to in a conversational way prevented me from, for the most part, falling down frustrating rabbit holes where I beat my head against the wall for hours, unable to figure something out.
ChatGPT is great at brainstorming and thinking through problems conceptually and breaking them down into workable steps. For example, when I asked it to create a plan to predict which film will win Best Picture at the Academy Awards, it gave me a high-level overview of the steps required (more on this below).
It was great at providing lists of potential factors to consider when choosing data and models.
It was also helpful in thinking of potential reasons different factors were included or excluded in the model, or had positive or negative coefficients.
It is excellent at quickly generating code for doing almost any simple data science task like reshaping data, building and testing simple models, and generating simple data visualizations of the result. This code isn’t flawless though — more on this below.
It can also refine code — given a piece of code and the error it can help to solve the issue.
I found one of its greatest strengths to be text editing. While I didn’t use that capability much in the analysis, it was helpful in writing this post :).

Opportunities for Improvement:

I mentioned that ChatGPT is an amazing tool, but it is just that: another tool. There are limitations listed on the ChatGPT homepage, “May occasionally generate incorrect information”, “May occasionally produce harmful or biased content,” and “Limited knowledge of world and events after 2021.” Based on these limitations, it only makes sense to treat ChatGPT as a tool that can often be incredibly helpful, but not as a ‘source of truth.’
ChatGPT is built to answer the question you ask it, but it cannot ask the necessary clarification questions required to answer the question like a human would. For example, ChatGPT suggested building a regression model to predict who will win at the Oscars. However, when given a sample of data and asked it to write code to create the regression model, it produced a linear regression model. Since the dependent variable is binary (either a film/person won the award or not), a logistic regression model is most appropriate. Another example of this was in data sourcing. When asked about how to get critical reception data, ChatGPT suggested using web scraping packages to get all the data from the Rotten Tomatoes website, even though there are other datasets available on the internet that are much easier to use. While technically ChatGPT answered the questions with a solution, it is important to note that its suggestions may not always be the best solution.
ChatGPT (at least the version that is free and publicly available now) was trained on data ending in 2021 and does not have access to the internet. This could be the reason it is not good at finding good data sets — it cannot search the internet for them. It may also be the reason it struggled with web scraping code — it cannot see the html of a website itself.
While it is often able to generate code instantly to carry out almost any data science task, it is not perfect. It may generate code that produces errors or doesn’t satisfy the task given. Providing explicit and descriptive instructions is essential to avoid ambiguity in the prompt. Furthermore, while ChatGPT may generate a piece of code that is 80–90 percent good to go, iterating with ChatGPT is necessary to complete the remaining part.
ChatGPT may refine code and send it back, but the refined code is not always perfect. Iterating on a piece of code with ChatGPT may take a lot of time, to the point where it may be faster to write it from scratch. This depends as much on an individual’s ability to work with ChatGPT (ability to generate descriptive prompts, asking the right follow up questions, etc.) as it does on an individual’s technical coding ability.
While ChatGPT can generally remember previous prompts, it may not remember everything. In situations where there are multiple issues, it may reintroduce the first issue that has already been solved while iterating on the second issue. My strategy in these situations is to “reset” and say, “Here is my current code and here is my current problem,” and that generally seems to work.
Finally, ChatGPT struggled to interpret the final results. I asked if the results made intuitive sense, and some of its response was coherent and logical but some was not, and included false statements.

I sent the above list of strengths and weaknesses to ChatGPT, to which it responded, “Overall, I think you have provided a fair and accurate assessment of my strengths and weaknesses, and I hope I can continue to be a helpful tool for your data science projects.” It also stated, “It’s important to remember that while ChatGPT can be incredibly helpful, it is not a substitute for human expertise and judgment.”

Overall Plan

Given a very basic initial prompt of, “I want to build a model to predict who will win the 2023 Academy Awards. How could we create a dataset and a model to predict winners?”, it returned:

Building a model to predict the winners of the 2023 Academy Awards would require collecting data on several key factors that could influence the outcome of the awards. Here are some steps you could take to create a dataset and a model for this task:

Identify the categories that you want to predict. The Academy Awards have several categories, such as Best Picture, Best Director, Best Actor, Best Actress, etc. Depending on your goals, you may want to focus on predicting one or more of these categories.

Gather data on the nominees. Once the nominees for the relevant categories have been announced, you can gather data on them. This could include information on the films they are nominated for (e.g., budget, box office performance, critical reception), their past awards history, and their public profile (e.g., social media following, interviews).

Identify additional factors that could impact the outcome. In addition to the nominees themselves, there are other factors that could impact the outcome of the awards, such as the historical trends of the Academy Awards, the demographics of the voting members, and any relevant social or political context.

Build a dataset. Using the data you’ve gathered, create a dataset that includes information on the nominees and any additional factors that you’ve identified. The dataset should also include information on the winners of the awards in the categories you’re predicting for previous years.

Select an appropriate model. Depending on the nature of the data, you may need to select an appropriate model for prediction. For example, you could use a linear regression model to predict box office performance, or a classification model to predict the winner of a particular category.

Train and validate the model. Once you’ve selected a model, you can train it using the dataset you’ve created. Use a validation set to test the model’s performance and adjust the parameters as needed.

Use the model to make predictions. Once you’re satisfied with the model’s performance, you can use it to predict the winners of the 2023 Academy Awards based on the data you’ve gathered.

Data Collection and Cleaning

Following ChatGPT’s advice, I started with gathering data on all of the nominees. When asked what kind of data would be best, it said:

Critical reception: The critical reception of a film, as reflected by its score on websites like Rotten Tomatoes or Metacritic, can be a strong predictor of Oscar success.

Audience reception: The audience reception of a film, as reflected by its box office performance or user ratings on websites like IMDb or Letterboxd, can also be a factor in predicting Oscar success.

Industry buzz: Industry buzz, including early reviews, film festival screenings, and awards from other organizations, can indicate a film’s popularity and chances of winning.

Filmmaker history: The history of the filmmakers involved in a project can also be a factor, as Oscar voters tend to favor established directors and actors who have previously been nominated or won awards.

Genre: Certain genres, such as historical dramas, biopics, and adaptations of literary works, have historically been more likely to win Best Picture.

Political and social issues: The Academy often rewards films that address important social and political issues or that have a strong message.

Marketing and distribution: The marketing and distribution strategies of a film’s studio can also play a role in its success, as a well-executed campaign can generate buzz and build momentum leading up to the awards season.

Getting and cleaning the data

I found this dataset on Kaggle that has all of the previous Academy Award nominees and winners from 1927 to 2023. Thank you Raphael Fontes and David Lu! This dataset has all of the movies nominated for all of the categories, the years they were made, and whether they won. Now we need to get data for each of these films. I have used the TMDB API in previous posts to get data for the cast and crew of films. That API also has info on budget, revenue, genre, runtime, and language. To query the TMDB API, we need the TMDB ID for each of the films on our Oscar noms list. Here is the code we wrote to query the TMDB API using the film name and year in the Oscar data:

# get TMDB API key from config.py file
api_key = config.tmdb_api_key

#This data came from this Kaggle contest: https://www.kaggle.com/datasets/unanimad/the-oscar-award?resource=download
df = pd.read_csv("the_oscar_award.csv")

# Get a list of unique movie names and years
unique_movies = df[['film', 'year_film']].drop_duplicates()

# Create an empty dictionary to store the movie IDs
movie_ids = {}

# Loop over the unique movie names and query the TMDB API for the ID
for i, row in unique_movies.iterrows():
    # Construct the API URL
    url = f"https://api.themoviedb.org/3/search/movie?api_key={api_key}&query={row['film']}&year={row['year_film']}"
    
    # Query the API
    response = requests.get(url)
    data = response.json()
    
    # Extract the movie ID if the search is successful
    if data['total_results'] > 0:
        movie_id = data['results'][0]['id']
        movie_name = f"{row['film']} ({row['year_film']})"
        movie_ids[movie_name] = movie_id

# Update the dataframe with the movie IDs
df['tmdb_id'] = df.apply(lambda x: movie_ids.get(f"{x['film']} ({x['year_film']})", ""), axis=1)

The code above is a great example of the power of ChatGPT and why a human is needed to guide it. The code was generated by ChatGPT, and did not require much iteration, so it definitely saved me time to have ChatGPT write it. However, the first version of the code did not include ‘year_film’ in the query and so we were getting duplicate results and conflicting TMDB IDs because different films can have the same name. There have been four versions of Little Women since 1927, for example, all of which have been nominated for multiple Oscars.

While this code gets the TMDBs for most of the films on the list, there are some missing values. I manually entered a bunch of extra IDs, especially for films nominated for Best Picture — see my notebook for those IDs.

Using the TMDB IDs, we can get more data on each film from the TMDB API:

# Convert 'tmdb_id' column to integer type
df['tmdb_id'] = df['tmdb_id'].astype(str).replace('\.0', '', regex=True)

# Get unique tmdb_id values
unique_tmdb_ids = df['tmdb_id'].unique()

# Create an empty list to store movie details
movies = []

# Loop over unique tmdb_id values and query the API for each movie
for tmdb_id in unique_tmdb_ids:
    #print(tmdb_id)
    url = f"https://api.themoviedb.org/3/movie/{tmdb_id}?api_key={api_key}&language=en-US"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        movie = data
        movies.append(movie)
    else:
        print(f"Error getting data for tmdb_id={tmdb_id}")

    # Extract all information for each movie
    movie = data
    movies.append(movie)

# Create a new dataframe with the movie details
movies_df = pd.DataFrame(movies)

# Rename the 'id' column to 'tmdb_id'
movies_df = movies_df.rename(columns={'id': 'tmdb_id'})

movies_df['tmdb_id'] = movies_df['tmdb_id'].astype(str).replace('\.0', '', regex=True)

# Merge the new dataframe with the original dataframe on 'tmdb_id'
merged_df = df.merge(movies_df, on='tmdb_id')

This brings many additional data fields into our DataFrame like budget, revenue, genre, and runtime.

Separating individual names into rows

One issue with the data is that there is either one name or multiple names associated with each nomination (see screenshot below).

In the screenshot above, we see that Triangle of Sadness was nominated for Best Picture, but there are three names associated with that nomination. We need to extract all of the names from the “name” column and create new rows for each name. We also want to exclude all of the extra words like, “Written by.”

Here is another great example of the strengths and weaknesses of ChatGPT: it suggested that we include “filmmaker history” as one of the features, and it generated the code that created this dataset, but it didn’t know that based on this dataset, we would need to break the name column into distinct names in order to build a “filmmaker history” feature. Also, when I suggested that we should extract names from this column, it suggested using commas to split the strings in each cell, which wouldn’t satisfy all cases. I suggested that we use NLP to extract names, and it gave me some options. We iterated on different packages and versions of code, and landed on this, which works pretty well:

from flair.models import SequenceTagger
from flair.data import Sentence

tagger = SequenceTagger.load('ner')

# define function to extract names from text
def extract_names(text):
    sentence = Sentence(text)
    tagger.predict(sentence)
    names = []
    for entity in sentence.get_spans('ner'):
        if entity.tag == 'PER':
            names.append(entity.text.strip())
    return names

# create a new dataframe to store the split names
new_df = pd.DataFrame(columns=df.columns)

# iterate over each row in the original dataframe
for index, row in df.iterrows():
    # check if the value in the 'credits' column is a string
    if isinstance(row['name'], str):
        # extract names from the 'credits' column using the extract_names function
        names = extract_names(row['name'])
        # iterate over each extracted name
        for name in names:
            # create a new row in the new dataframe for each extracted name
            new_row = row.copy()
            new_row['name'] = name
            new_df = new_df.append(new_row)
    else:
        # if the value is not a string, just append the row to the new dataframe
        new_df = new_df.append(row)

# reset the index of the new dataframe
new_df = new_df.reset_index(drop=True)

Counting previous wins and nominations for each nominee

Now that we have cleaned the dataset up, we want to include fields for each nominees’ previous nominations and wins. By previous nominations, I mean that for each nominee (a person) we want to know how many times they have been nominated for Oscars in previous years, and how many times they have won. This should cover the “Filmmaker History” factor that ChatGPT suggested that we include in our analysis.

# create dictionaries to store previous nominations and wins for each name
previous_nominations = {}
previous_wins = {}

# loop through each row in the dataframe
for index, row in df.iterrows():

    # extract the name and ceremony year from the current row
    name = row["name"]
    year = row["year_ceremony"]

    # check if the name has already appeared in previous ceremonies
    if name in previous_nominations:

        # count the number of previous nominations and wins for the current name
        previous_nom = sum([1 for x in previous_nominations.get(name, []) if x < year])
        previous_win = sum([1 for x in previous_wins.get(name, []) if x < year])


    else:
        # if the name hasn't appeared in previous ceremonies, set the counts to 0
        previous_nom = 0
        previous_win = 0

    # update the previous nominations and wins dictionaries for the current name and ceremony year
    previous_nominations.setdefault(name, []).append(year)
    if row["winner"] == True:
        previous_wins.setdefault(name, []).append(year)

    # assign the previous nomination and win counts to new columns in the dataframe
    df.at[index, "previous_nominations"] = previous_nom
    df.at[index, "previous_wins"] = previous_win

# convert the columns to integers
df["previous_nominations"] = df["previous_nominations"].astype(int)
df["previous_wins"] = df["previous_wins"].astype(int)

The code above goes through each nominee’s name and counts the number of times they have been nominated in the past and the number of times they have won in the past.

Total number of nominations for each film may also be relevant. Films with many nominations across categories historically tend to have a higher likelihood of winning Best Picture. This was not a feature that ChatGPT originally suggested that we include, but an observation of my own. I asked ChatGPT if this might be useful to include and it agreed.

To include this feature in our data, we use this code:

df['other_noms'] = df.groupby(['film','year_film'])['category'].transform('nunique')

Using OMDB to get critical and audience reception data

Now we need critical and audience reception data. While TMDB has a popularity field, we want audience reception from IMDb and critical reception from Rotten Tomatoes and Metacritic. The TMDB does give us the IMDb ID, which we can use to get IMDb, Rotten Tomatoes, and Metacritic data using the OMDB API.

# Set the OMDB API key
api_key = config.omdb_api_key

# Create an empty list to store the scores for each movie
scores = []

# Loop through each row in the dataframe and search for the movie using the OMDB API
for index, row in df.iterrows():
    # Extract the IMDB ID from the row
    imdb_id = row["imdb_id"]
    
    # Make a request to the API and parse the response as JSON
    response = requests.get(f"http://www.omdbapi.com/?i={imdb_id}&apikey={api_key}")
    data = response.json()
    
    # Extract the ratings from the response
    ratings = {"imdb_rating": None, "rotten_tomatoes_rating": None, "metacritic_rating": None}
    for rating in data.get("Ratings", []):
        if rating["Source"] == "Rotten Tomatoes":
            ratings["rotten_tomatoes_rating"] = rating["Value"]
        elif rating["Source"] == "Metacritic":
            ratings["metacritic_rating"] = rating["Value"]
        elif rating["Source"] == "Internet Movie Database":
            ratings["imdb_rating"] = rating["Value"]
    
    # Append the ratings to the scores list
    scores.append({
        "imdb_id": imdb_id,
        **ratings
    })

# Create a pandas dataframe from the list of scores
scores_df = pd.DataFrame(scores)

# Merge the scores dataframe with the original dataframe on imdb_id
df = pd.merge(df, scores_df, on="imdb_id", how="left")

You’ll need to get an API key from OMDB to use the API, and depending on how many queries you do, you may need to pay a small monthly fee.

Breaking categorical variables into independent features

We want to include the “genre” field in our analysis, but right now, each film has multiple genres associated with it — drama, romance, and action, for example. These genres are listed as dicts in the genre column for each film (see screenshot below).

We need to create new columns for each genre. Here is the code that ChatGPT generated to do this:

from ast import literal_eval

# convert the 'genres' column from a string to a list of dictionaries
df['genres'] = df['genres'].apply(literal_eval)

# create a set of all unique genre names
genre_set = set()
for genres in df['genres']:
    for genre in genres:
        genre_set.add(genre['name'])

# create a new column for each genre
for genre in genre_set:
    df[genre] = 0

# set the value to 1 for each genre that applies to the movie
for index, row in df.iterrows():
    genres = row['genres']
    for genre in genres:
        df.at[index, genre['name']] = 1

# drop the original 'genres' column
df.drop(columns=['genres'], inplace=True)

Likewise, there is a column called “original_language”. We want to create new columns for each potential language.

# create a set of all unique original language names
language_set = set(df['original_language'])

# create a new column for each original language
for language in language_set:
    df[language] = 0

# set the value to 1 for each original language that applies to the movie
for index, row in df.iterrows():
    language = row['original_language']
    df.at[index, language] = 1

# drop the original 'original_language' column
df.drop(columns=['original_language'], inplace=True)

After creating these genre columns, I wanted to explore some patterns historically. Not surprisingly, the genre with the most nominations and the most wins is Drama.

I also wanted to see the portion of nominees that actually end up winning. While Drama is the most nominated and winning category, Family and Music have the highest winning percentage given nominations at 25 percent each.

What does this mean for this year’s nominees? For one, no Science Fiction film has ever won Best Picture, and both Everything Everywhere All at Once and Avatar: The Way of Water are categorized as Science Fiction. The Music genre has a good track record of winning Best Picture when they are nominated (6 wins out of 24 nominations). Both Tár and Elvis are categorized as Music.

Rather than keeping all of the genres as their own features for the model, under ChatGPT’s guidance, we separated them into the major genres and “other”, defined as genres that have a frequency of less than 5 percent.

genre_columns = ['Documentary', 'Horror', 'War', 'Thriller', 'Comedy', 'Family',
                 'TV Movie', 'Music', 'Western', 'Adventure', 'Drama', 'Romance',
                 'Action', 'Science Fiction', 'Fantasy', 'History', 'Mystery',
                 'Animation', 'Crime']

# Calculate the frequency of each genre
genre_freq = df[genre_columns].sum() / len(df)

# Create a list of genres to be included in the "Other" category
other_genres = list(genre_freq[genre_freq < 0.05].index)

# Add up the columns of genres that are frequently occurring
frequent_genres = [col for col in genre_columns if col not in other_genres]
df['Frequent_Genres'] = df[frequent_genres].sum(axis=1)

# Set the value of the "Other" category to 1 if any of the infrequent genres are present
df['Other'] = df[other_genres].max(axis=1)

# Drop the original genre columns
df.drop(columns=genre_columns, inplace=True)

Building and training the model

Now we can finally build and train the model. When I asked ChatGPT which model we should use, it responded:

There are several models that can be used to predict a binary outcome, including:

Logistic Regression: This is a widely used model for binary classification problems. It models the probability of the binary outcome based on the input features and uses a logistic function to convert the output to a probability.

Decision Trees: Decision trees can be used for binary classification problems as well. They work by recursively splitting the dataset based on the input features until a certain stopping criterion is met.

Random Forest: Random forests are an extension of decision trees, where multiple trees are built and combined to improve the accuracy and prevent overfitting.

Support Vector Machines (SVM): SVM is a powerful model that can be used for binary classification problems. It finds the best hyperplane that separates the two classes with the maximum margin.

Neural Networks: Neural networks can be used for binary classification problems as well. A neural network can have a single output node with a sigmoid activation function, which outputs a probability estimate.

The choice of the model depends on various factors such as the size of the dataset, the complexity of the problem, the interpretability of the model, and the available computational resources. It is always a good practice to try different models and compare their performance on the validation set before selecting the best one.

I decided to just use a logistic regression model. We split the data into a training data set and a test data set, using an 80–20 split. Then we separated films made in 2022 because that is what we are going to try to predict. We also used recursive feature elimination with cross-validation to limit the total number of features in our model. Here is the code that ChatGPT generated, with significant guidance and iteration from yours truly:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Define the feature columns
feature_cols = ['budget', 'revenue', 'runtime', 
                'previous_nominations','previous_wins', 'imdb_rating', 
                'rotten_tomatoes_rating', 'metacritic_rating',
               'Frequent_Genres', 'genre', 'english_language']

# Define the target column
target_col = 'winner'

# Define a function to convert the rating strings to float values
def convert_rating(x):
    if pd.isnull(x):
        return x
    elif isinstance(x, str):
        parts = x.split("/")
        if len(parts) == 1:
            return float(x.strip('%')) / 10
        elif len(parts) == 2:
            return float(parts[0]) / float(parts[1])
    else:
        return x


# Apply the function to the rating columns
df['imdb_rating'] = df['imdb_rating'].apply(convert_rating)
df['rotten_tomatoes_rating'] = df['rotten_tomatoes_rating'].apply(convert_rating)
df['metacritic_rating'] = df['metacritic_rating'].apply(convert_rating)

# Filter out the 2022 data for prediction
predict_df = df[df['year_film'] == 2022].dropna(subset=feature_cols)

# Split the data into training and test sets for all previous years
df = df.dropna(subset=feature_cols + [target_col])
train_df, test_df = train_test_split(df[df['year_film'] < 2022].
                                     dropna(subset=feature_cols + [target_col]), test_size=0.2, random_state=42, stratify=df[df['year_film'] < 2022][target_col])


train_df["winner"] = train_df["winner"].apply(lambda x: 1 if x == True else 0)
test_df["winner"] = test_df["winner"].apply(lambda x: 1 if x == False else 0)


# Scale the features
scaler = StandardScaler()
train_df[feature_cols] = scaler.fit_transform(train_df[feature_cols])
test_df[feature_cols] = scaler.transform(test_df[feature_cols])
predict_df[feature_cols] = scaler.transform(predict_df[feature_cols])

# Recursive feature elimination with cross-validation
estimator = LogisticRegression(multi_class='multinomial', solver='lbfgs')
selector = RFECV(estimator, step=1, cv=100)
selector = selector.fit(train_df[feature_cols], train_df[target_col])

# Select the top features
selected_features = [f for f, s in zip(feature_cols, selector.support_) if s]
print(selected_features)

# Train the model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(train_df[selected_features], train_df[target_col])

# Predict the test set and calculate the accuracy
test_df['predicted_winner'] = model.predict(test_df[selected_features])
test_df['predicted_prob'] = model.predict_proba(test_df[selected_features])[:,1]
predict_df['predicted_prob'] = model.predict_proba(predict_df[selected_features])[:,1]


# Convert the predicted probabilities to binary predictions using a threshold value of 0.5
test_df['predicted_winner_binary'] = test_df['predicted_prob'].apply(lambda x: 1 if x >= 0.05 else 0)

# Calculate the accuracy, precision, recall, and F1 score based on the binary predictions
accuracy = accuracy_score(test_df['winner'], test_df['predicted_winner_binary'])
precision = precision_score(test_df['winner'], test_df['predicted_winner_binary'])
recall = recall_score(test_df['winner'], test_df['predicted_winner_binary'])
f1 = f1_score(test_df['winner'], test_df['predicted_winner_binary'])

# Print the results
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 score: {f1:.2f}")

The code above, not surprisingly, took the most time to write. ChatGPT was able to immediately generate a piece of code that created a logistic regression model for the data, but it wasn’t perfect. The “ratings” columns, for example, hadn’t yet been converted to float objects, so they couldn’t be used in the model. There was no recursive feature elimination in the first version either, it just built a model that used every variable. There was no accuracy, precision, recall or F1 scores in the first version; I had to ask ChatGPT how we could validate our model.

The features selected for the model are: budget, revenue, runtime, previous_nominations, previous_wins, imdb_rating, metacritic_rating, genre, other_noms, and english_language. Overall, the model performs relatively well with an accuracy of 0.45, precision of 0.72, recall of 0.54, and an F1 score of 0.62.

I asked ChatGPT to interpret these results. It described what each of these measures represents and then said:

In summary, the model has a low accuracy, but a high precision, indicating that when it predicts a movie will win Best Picture, it is often correct. However, the model’s recall is relatively low, meaning that it does not identify all of the actual Best Picture winners in the test set. The F1 score is also relatively low, suggesting that the model’s performance is moderate overall.

To see the coefficients associated with each feature we use this code:

# Print the coefficients and their corresponding feature names
for feature, coef in zip(selected_features, model.coef_[0]):
    print(f"{feature}: {coef:.2f}")

The coefficients generally make intuitive sense to me. Revenue, runtime, previous_wins, imdb_rating, metacritic_rating, other_noms, and english_language are all positively correlated with winning, while budget, previous_nominations, and genre are all negatively correlated.

Why is previous_nominations negatively correlated with winning? I asked ChatGPT to help me understand these coefficients:

budget: Negative (-0.58). This means that as the budget of a movie increases, the likelihood of it winning Best Picture decreases. This may be because bigger budget movies are often seen as more commercial and less “artistic”, which is what the Academy tends to reward.

genre: Negative (-0.23). This means that certain genres are less likely to win Best Picture than others. It’s possible that the Academy has certain biases or preferences when it comes to genres.

revenue: Positive (0.26). This means that as the revenue of a movie increases, the likelihood of it winning Best Picture increases. This could be because movies that are financially successful are often seen as more popular and more culturally significant.

runtime: Positive (0.10). This means that as the runtime of a movie increases, the likelihood of it winning Best Picture increases. This could be because longer movies are often seen as more ambitious and more artistically significant.

previous_nominations: Negative (-0.28). This means that as the number of previous nominations a movie has received increases, the likelihood of it winning Best Picture decreases. This may be because the Academy tends to spread the awards around and reward new films or filmmakers rather than ones that have already been recognized in the past.

previous_wins: Positive (0.10). This means that as the number of previous wins a movie has received increases, the likelihood of it winning Best Picture increases. This could be because the Academy tends to reward films or filmmakers that have a history of success.

imdb_rating: Positive (0.25). This means that as the IMDB rating of a movie increases, the likelihood of it winning Best Picture increases. This could be because high IMDB ratings indicate popular and critically acclaimed movies.

metacritic_rating: Positive (0.11). This means that as the Metacritic rating of a movie increases, the likelihood of it winning Best Picture increases. This could be because high Metacritic ratings indicate critically acclaimed movies.

english_language: Positive (0.12). This means that movies in English are more likely to win Best Picture than movies in other languages. This may be because the majority of Academy voters are English-speaking and may have biases towards movies in their own language.

other_noms: Positive (0.56). This means that as the number of nominations a movie receives in other categories increases, the likelihood of it winning Best Picture increases. This makes sense given what we discussed earlier about the correlation between total nominations and Best Picture wins.

To me, the response above is very helpful. It may or may not be “correct”, but as a data scientist trying to make sense of coefficients, having potential explanations is useful to think through. These explanations may lead to more questions, more data exploration, and model refinement, which will make for more thorough overall analysis.

Make predictions and analyze results

Now we want to make predictions for this year’s Oscars and view the contribution of each of these features towards the probability of winning.

# Create a table with the features, coefficients, and contribution to the probability of winning
selected_columns = ['name', 'film', 'category'] + selected_features
predict_df = predict_df[selected_columns].copy()

coef = pd.Series(model.coef_[0], index=selected_features)
predict_df["predicted_winner"] = model.predict(predict_df[selected_features])
predict_df["predicted_prob"] = model.predict_proba(predict_df[selected_features])[:, 1]

contributions = []
for idx, row in predict_df.iterrows():
    name = row['name']
    film = row['film']
    category = row['category']
    for feature, coef_val in coef.iteritems():
        feature_value = row[feature]
        contribution = feature_value * coef_val
        contributions.append([name, film, category, feature, coef_val, feature_value, contribution, row['predicted_prob']])

contributions_df = pd.DataFrame(contributions, columns=['name', 'film', 'category', 'feature', 'coefficient', 'feature_value', 'contribution', 'predicted_prob'])

contributions_df = pd.DataFrame(contributions_df.
             groupby(['film','feature'])['predicted_prob','contribution'].mean()).sort_values(by='predicted_prob',ascending=False)

contributions_df = contributions_df.reset_index()

Here are final results:

These numbers indicate the probability of winning for each of the films. Everything Everywhere All at Once (EEAAO), has the highest chance of winning at 26.9 percent, followed by The Banshees of Inisherin at 25.3 percent. It should be noted that the accuracy, recall, precision and F1 scores listed above were created using a threshold of 5 percent, meaning that there are actually four films with a non-negligible chance of winning: EEAAO, The Banshees of Inisherin, Tár, and All Quiet on the Western Front.

Now we have a DataFrame of all the films, their probability of winning Best Picture, and the relative contribution of each of the features. To view this data in a more intuitive way, we made a heat map:

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import SymLogNorm

# pivot the dataframe to get the contributions by feature for each film
contributions_by_film = contributions_df.pivot(index='film', columns='feature', values='contribution')

# sort the films by predicted_prob in descending order
contributions_by_film = contributions_by_film.loc[contributions_df.sort_values('predicted_prob', ascending=False)['film'].unique()]

# create a heatmap
fig, ax = plt.subplots(figsize=(10, 10))
cmap = sns.diverging_palette(10, 250, s=75, l=40, n=9, center='light', as_cmap=True)
norm = SymLogNorm(linthresh=0.03, linscale=0.03, vmin=-1, vmax=1)
sns.heatmap(contributions_by_film, cmap=cmap, annot=True, fmt='.2f', linewidths=.5, norm=norm)
ax.set_title('Contributions by Feature for Each Film')
plt.show()

The films are ranked with the highest probability of winning at the top. The columns correspond to the different features we have used in the model. The value and color of each cell indicates the contribution towards the probability of winning of each of the factors to each of the films.

Before we try to interpret these findings ourselves, let’s hear ChatGPT’s interpretation:

Firstly, the number of previous nominations and wins a film has received is positively correlated with its probability of winning. This could explain why Everything Everywhere All at Once and The Banshees of Inisherin are ranked the highest, as both films have received numerous nominations in our data.

Secondly, the number of other nominations a film has received is also positively correlated with its probability of winning. This could explain why All Quiet on the Western Front is ranked higher than The Fabelmans and Elvis, as it has received more nominations in our data.

Thirdly, the revenue a film generates is positively correlated with its probability of winning. This could explain why Tár is ranked higher than The Fabelmans, Elvis, Top Gun: Maverick, Triangle of Sadness, and Women Talking, as it has generated more revenue in our data.

Lastly, some factors may be more difficult to interpret, such as the negative correlation between budget and winning. It’s possible that higher budget films are perceived as being more commercial or less artistic, leading to a lower probability of winning in some cases.

Again, I find this interpretation insightful both because of what it seems to get right and because of its inaccuracies. Firstly, the number of previous nominations is at the person level — a film cannot have been nominated in previous years. Also, the number of previous nominations is actually negatively correlated with winning, so that explanation doesn’t make sense. The second point is true — the number of nominations a film has is positively correlated with winning Best Picture and All Quiet on the Western Front has more nominations than almost any other movie (other than EEAAO). Third, while it is true that revenue is positively correlated with the probability of winning, that is not why Tár is ranked higher than Top Gun: Maverick.

Top Gun is one of the highest grossing films of the year. It made nearly 1.5 billion USD — 8.75 times its budget and as Steven Spielberg told Tom Cruise, “You saved Hollywood’s ass. And, you might have saved theatrical distribution. Seriously. Maverick might have saved the entire theatrical industry.” The only movie on this list that had a higher revenue was Avatar: The Way of Water, with a total revenue of nearly 2.3 billion USD. But the budget on Avatar was 460 million USD, so the return was only about five times the budget.

In another response, ChatGPT suggested that Tár is ranked so high because it is a documentary and documentaries are more likely to win Best Picture. But Tár is not a documentary. It is about the fictional orchestra conductor Lydia Tár. And even if it were a documentary, no documentary has ever won Best Picture at the Oscars, so that interpretation doesn’t make sense.

The real reason EEAAO is ranked number one is because it has received 11 total nominations at this year’s Academy Awards. EEAAO is also incredibly popular according based on its IMDb rating, second only to Top Gun: Maverick. It has a relatively high Metacritic rating (fourth highest of the nominees) and the producers have never received any Oscar nominations in the past — remember that as the number of previous nominations increases, the likelihood of winning decreases.

The big factor working against EEAAO is that it is categorized as science fiction, and historically, no science fiction film has ever won Best Picture. In fact, only eight films categorized by one of the less frequent genres have ever won Best Picture. If you are curious, here is a list of those eight films. It is worth noting that The Shape of Water, often considered a science fiction film, is actually classified as fantasy according to TMDB.

The Banshees of Inisherin is ranked number two because it has nine total nominations, is relatively low budget (20 million USD), it has a pretty high IMDb rating, and a very high Metacritic rating (second only to Tár). The relatively short runtime (114 minutes) and low revenue (41 million USD) lowered its overall probability some.

All Quiet on the Western Front is third because it also received nine nominations, is also lower budget (also 20 million USD), was made by previously unrecognized producers, and also has a pretty high IMDb rating. But it is in German and only two non-english films have ever won Best Picture: Parasite and The Artist.

Rounding out the fourth spot is Tár. Tár has the highest Metacritic score of all the nominees but only has six total nominations. However, CODA won Best Picture last year with only 3 nominations. If we exclude the ‘other nominations’ data Tár is the most likely to win.

All four films above have a non-negligible chance of winning. The rest are not critically acclaimed enough (Elvis, Triangle of Sadness, Avatar: The Way of Water), are not popular enough (Elvis, Triangle, Women Talking), are too high budget (Elvis, Top Gun, Avatar, The Fabelmans, Women Talking), do not have high enough revenue (Fabelmans, Triangle, Women Talking), too few nominations (Top Gun, Triangle, Women Talking, Avatar), or the producers have already been significantly recognized (Fabelmans).

Out of curiosity, I ran the model against last year’s nominees, and here are the results:

It would have predicted that The Power of the Dog would win, and that CODA, the film that did end up winning, would have almost no chance of winning. That is largely because CODA had only three total nominations, compared to the twelve nominations for The Power of the Dog.

Since the total number of nominations has such a big factor on the results, I ran the model again without this feature. In that version of the model, CODA was predicted to win last year, and Tár is predicted to win this year.

Conclusion

ChatGPT is an incredible tool for a data scientist but there are definitely some shortcomings. It can’t take over the work entirely — there still needs to be a human guiding the analysis, for now.

For the final predictions, it is important to remember the factors that we are not including in this model: industry buzz (early reviews, other awards, etc.), political and social issues, and marketing and distribution. It is impossible to know what effect those factors would have on this model without including the data.

In terms of other awards, The Fabelmans won the Golden Globe for Drama Motion Picture, The Banshees of Inisherin won the Golden Globe for Musical or Comedy Motion Picture, All Quiet on the Western Front won the BAFTA for Best Film, and EEAAO won Outstanding Cast.

Personally, I, as a human, think it would not be surprising if any of the top four films (EEAAO, Banshees, All Quiet, or Tár) won.