An NLP Movie History: Plotting the ebbs and flows of cinema trends

  • 20 April 2022
  • 1 reply

Badge +1

An NLP Movie History

Plotting the ebbs and flows of cinema trends


The Liberty Theater in New Orleans, Louisiana, ca. 1936. From The New York Public Library


Are superhero films going to fall out of fashion? Marvel fans might feel skeptical about ever seeing the likes of Black Widow or Thor bomb at the box office—and their attitude is certainly justified by the overwhelming success of movies like Spider-Man: No Way Home or The Batman. Still, cinema history is full of twists and turns: genres that have risen to popularity, remained in the spotlight for a while, and then quietly left the stage. Think, for instance, of the witty screwball comedies of the 1930s, the intricate noirs of the 1940s, or the sensational disaster movies of the 1970s.

Can data science help us make sense of these sudden changes in cinematic taste, and perhaps predict what kinds of movies might soon be coming to—or departing from—a screen near us?

To answer this question, we can turn to Natural Language Processing—a surprisingly helpful strategy for this sort of analysis. In this article, I will use NL API—an API that returns a suite of NLP analytics—to explore how the language used to describe movies changes depending on the year when the movie was produced.

In what follows, I will show you how to create plots such as the one you see above. This line chart shows the changing popularity of “Music” as a topic in film synopses, with bumps that neatly align with the so-called Golden Era of musical film (1930-1950), with the brief revival of Broadway adaptations heralded by West Side Story (1961), and with the contemporary resurgence of the genre inaugurated by Moulin Rouge! (2001) and Chicago (2002).

The data and the code that I have used to generate this plot and the ones below are available in this Jupyter Notebook. Please feel free to follow along.


1. The Setup


1.1. The Dataset

We’ve all read a movie plot before: a more or less detailed recap of a film, written down as part of a review, a magazine feature, or a Wikipedia article. To begin this exploration of movie trends with NLP, I have used a Kaggle dataset containing about 35,000 movie plots taken from Wikipedia. These synopses are arranged in a .csv file that also lists the year of release, the country of origin, and the genre of each movie—from the one-minute-long silent satire Kansas Saloon Smashers (1901) to Ferzan Özpetek’s Turkish-Italian travelogue Red Istanbul (2017).

For the sake of consistency, I have decided to consider only the 17,377 plots tagged as “American,” the largest share of movies represented in the dataset. Similarly, since the first three decades represented in this dataset comprise—on average—significantly fewer plots than the subsequent ones, I have picked only movies produced between 1930 and 2017.

1.2. The Tool

Since I intend to examine several features at once (including, for instance, a list of predominant lemmas and topics, a breakdown of the main concepts and feelings detected in each movie plot, and a sentiment analysis score), I have decided to use NL API for this project. This suite of text processing tools can be invoked from within a Jupyter Notebook in order to receive back a set of analytics, which can then be manipulated and visualized using current Python libraries.

1.3. Preprocessing the Dataset

We can start by grouping together the plots by year of release. During this step, we should also perform a quick cleanup of the texts themselves, removing all the strings between round and square brackets (generally used for footnote references or for information that is not relevant to the plot itself).

def group_plots_by_year(df):
return df.groupby("Release Year")["Plot"].apply(
lambda x: [remove_brackets(plot) for plot in x.tolist()])


2. The Analysis


2.1. Invoking the API

To call the NL API on your data you will need to sign up on the Developer website (the tool is free to use, and the sign-up process is a matter of seconds). I recommend storing your credentials in a .env file and using Python-dotenv to load them into your environment. Once you have done so, processing a text and retrieving the analytics you need takes only a few lines of code.

For instance, the first of the following two functions will allow you to perform a full analysis on a chunk of text (including semantic and morphological parsing, extractions of main tokens and entities, sentiment analysis, and so on). The second function will perform a classification analysis according to one of the four taxonomies that are available through the API (“IPTC Media Topics” is the default one here, but we will also perform an “emotional traits” analysis in this experiment).

from dotenv import load_dotenv
from import ExpertAiClient


def nl_api_full(string, language="en"):
client = ExpertAiClient()
api_obj = client.full_analysis(body={"document": {"text": string}},
params={"language": language})
return api_obj

def nl_api_classification(string, taxonomy, language="en"):
client = ExpertAiClient()
api_obj = client.classification(body={"document": {"text": string}},
params={"taxonomy": taxonomy,
"language": language})
return api_obj

Processing a single chunk of text to retrieve these statistics takes relatively little time; however, doing so for the entire dataset would be extremely time-consuming. To speed up the process, I have created a set of functions that process random samples of text from the dataset. For each year of the range that interests us, the functions randomly select 100 movie plots and then extract a random 500-character sample from each one of these plots. By using this method, I have processed samples from 8,800 movie plots—about 53% of the total available for this analysis.

def process_multiple_samples(year, plots_by_year):
full_api_objs = []
iptc_api_objs = []
emotional_api_objs = []
samples_list = plot_sampler(plots_by_year[year])
for _ in range(len(samples_list)):
a_sample = samples_list[_]
full_api_obj_for_sample = nl_api_full(a_sample)
iptc_api_obj_for_sample = nl_api_classification(a_sample,
emotional_api_obj_for_sample = nl_api_classification(a_sample,
return full_api_objs, iptc_api_objs, emotional_api_objs

This function sends each sample to the API three times: once for the full analysis, once for the IPTC topics classification, and once for the emotional traits classification. The 300 JSON objects returned by the API for each year are stored in three separate lists and then collected in a dedicated dictionary.

2.2. Extracting the Data

We are mainly interested in seven of the several features considered by NL API: the main lemmas, the main concepts (or syncons), the main topics, the sentiment analysis score, the IPTC media topics classification, and the emotional traits classification. A simple set of functions can help us extract all these features from the JSON objects returned from the API. For instance, the following two functions extract the sentiment score and the list of main lemmas associated with each plot sample.

def extract_sentiment_score(full_api_obj):
sentiment = full_api_obj.sentiment.overall
return sentiment

def extract_scores_by_lemma(full_api_obj):
scores_by_lemma = {}
for lemma in full_api_obj.main_lemmas:
scores_by_lemma[lemma.value] = lemma.score
return scores_by_lemma

2.3. Plotting the Data

With all the data collected and aggregated by year, we can now proceed to plotting these values onto a chart to discover how movie plots might have changed over time.

One should of course keep in mind that these values do not refer to the movies themselves, nor to the actual moment in which they were released or produced. The results we see are derived from texts written about these movies, but at different times and by different people; so, they should be taken with a grain of salt. Still, albeit indirectly, they might give us an idea of how the overall moods, themes, and subjects of the films they describe might have changed.

I have created three functions to visualize the different types of features returned by NL API: the scores of specific lemmas, topics, or concepts; the number of unique lemmas, topics, or concepts; and the sentiment analysis score. To increase the readability of the resulting graphs, and to minimize the impact of any potential outliers, I have used SciPy’s UnivariateSpline function to normalize and smoothen these values, plotting both the raw values and the normalized ones in the resulting charts.

def smoothen_values(x, y, interval, s_factor):
x = np.array(x)
y = np.array(y)
spline = UnivariateSpline(x, y/np.max(y),
smooth_x = np.linspace(x.min(), x.max(), interval)
smooth_y = spline(smooth_x)*np.max(y)
return smooth_x, smooth_y

def draw_plot(y_data, years, color_1, color_2, plot_title,
add_fill, interval, s_factor):
ax = plt.subplot()
sns.lineplot(x=years, y=y_data,
smooth_years, smooth_data = smoothen_values(years, y_data,
interval, s_factor)
sns.lineplot(x=smooth_years, y=smooth_data, linewidth=4, color=color_2)
if add_fill:
plt.fill_between(smooth_years, smooth_data, alpha=0.2, color=color_2)

def plot_scores(data_by_year,
if plot_title == "":
plot_title = f'Scores for "{token}" ({token_type.capitalize()})'
scores = extract_token_scores_by_year(data_by_year, token, token_type)
years = list(data_by_year.keys())
draw_plot(scores, years, color_1, color_2, plot_title,
add_fill, interval, s_factor)


3. The Results


3.1. From Horses to Space Shuttles

Some of the plots generated by these functions show trends that can be explained fairly easily.

In the chart above, for instance, we see how the syncon for “horse” tends to become less and less frequent in the dataset, echoing the declining success of westerns—and the increasing predominance of automobiles—throughout the decades.

Other plots highlight how the success of particular movies and franchises might have determined the emergence or revival of certain topics and settings.

In the chart above, we can see how high schools remained a relatively unpopular topic until the early 1980s, when Porky’s (1981) launched the genre of the raunchy teen comedy. The “high school” movie seems to have peaked again in the late 1990s (after American Pie, 1999) and then between the late 2000s and early 2010s (after High School Musical, 2006).

Other trends might have been impacted by historical events.

In the chart above, for instance, we can see how the topic of “astronautics” rose in popularity in the years leading to the first Moon landing of 1969, and then again in the late 1980s—following the Challenger disaster of 1986.

3.2. The Marriageless Plots

Other charts are not as easy to interpret.

In the chart above, for instance, we see that the topic of “marriage” has gradually become less prominent in movie plots, particularly between the 1930s and the 1980s. In the 1990s the topic seems to know a brief resurgence—perhaps led by the success of the marriage-centric comedy When Harry Met Sally... (1989).

Even more surprisingly, the chart above tracks the disappearance of “Love” from our screens. Detected by the API in many of these movie plots, this emotional trait seems to have become less prevalent throughout the decades—and especially after the 1960s. What, if anything, has made “love” and “marriage” less appealing to filmmakers?

3.3. Have US movies gotten boring?

We can also consider the numbers of unique tokens recorded for each year. This metric might give us an idea of whether the variety of movie subjects has increased or decreased over time. Do movies deal with more or with fewer topics and concepts year after year?

The chart you see above shows a clear pattern. The number of unique syncons in the dataset grows steadily: movies produced in the 1950s are associated with far more concepts than those produced in the 1930s. This trend continues until the mid-1970s before plateauing—and even decreasing a little—between the 1980s and the 2010s. If these short texts are in any way representative of the movies they recount, we might say that the variety of themes, characters, and circumstances brought to the screen steadily increased between 1930 and 1975, but that this increase has since slowed down. Have movies produced in the US become less varied or adventurous over the past few decades?

3.4. A Sentimental Rollercoaster

The last result that I want to share is perhaps the most puzzling.

The chart above shows the median sentiment analysis scores for each yearly collection of samples processed by the API. This score mirrors the alleged negativity or positivity of a text—negative assessments are associated with lower scores, while positive statements with higher ones. The values in this dataset vary greatly, and the shape of the smoothened line might be an artifact of the normalization process.

Nevertheless, two trends stand out. First, the median scores seem to fall and rise almost periodically, with drops happening every twenty years—in the mid-1930s and mid-1950s, and then in the early 1970s and in the early 1990s. Second, the general pattern seems that of a curve—with 2010s movies returning to the same alleged positivity of 1930s movies. Have movies generally become more lighthearted at the turn of the century? And are we in for a new wave of gloomy films?


* * *


What’s the next big thing in film history? Are the days of the superhero blockbuster counted? And should we expect some other old genre to resurface—a new take on swashbuckler films, perhaps, or the long-awaited return of the Universal Classic Monsters? Some of these plots give us an idea of where movies might be headed—and remind us that, whatever the case, “it’s going to be a bumpy night.”

This topic has been closed for comments

1 reply

Userlevel 4
Badge +1

Great research!