Projects

Selected work from my data science coursework and personal projects.

100 Years of Cinema

Analyzing 1,000 top-rated films from a public IMDb dataset to surface trends in runtime, recency bias, and directorial consistency.

Overview

Using a public Kaggle dataset of IMDb's top 1,000 films, I built a small end-to-end data pipeline: cleaning the raw CSV with Python, loading it into a SQLite database, writing SQL to surface insights, and visualizing the findings with matplotlib. Three findings stood out.

Tools: Python (pandas, sqlite, matplotlib), SQL

Finding 1: Movies have gotten dramatically longer

Bar chart showing average movie 
        runtime increasing from 86 minutes in the 1920s to 128 minutes in the 2010s

Average runtime climbed from 86 minutes in the 1920s to 128 minutes in the 2010s — a 50% increase over a century. The biggest jump came between the silent era (1920s) and the studio era (1930s–1950s), once sound and longer-form storytelling became the norm. Modern epics and franchise films have kept runtimes near the two-hour mark.

Finding 2: "Best of" lists have a strong recency bias

Bar chart showing the number of 
        films in IMDb's top 1000 by decade, with massive spikes in the 1990s, 2000s, 
        and 2010s

Just 11 films from the 1920s made the top 1,000 — versus 242 from the 2010s. Yet the average rating barely shifts across decades (8.13 in the 1920s vs. 7.92 in the 2010s). The takeaway: "best of all time" lists reflect who's voting now as much as objective film quality. (Note: 2020s data only includes films through 2020, hence the small bar.)

Finding 3: Christopher Nolan is modern cinema's most consistent director

Horizontal bar chart of the top 10 
        directors by average IMDb rating, with Christopher Nolan at the top with an 
        8.46 average across 8 films

Of directors with at least three films in the top 1,000, Christopher Nolan leads with an 8.46 average across eight films — well above the dataset's overall average of 7.95. Behind him: Peter Jackson and Francis Ford Coppola tied at 8.40. Notable for sheer volume of acclaimed work: Akira Kurosawa with 10 films in the list.

The Code

The pipeline runs in three steps. First, cleaning the raw data - IMDb's Runtime, Released_Year, and Gross columns all come in as text and need to be coerced into proper numeric types:

import pandas as pd
import sqlite3

df = pd.read_csv('imdb_top_1000.csv')

# Strip " min" suffix and convert to integer
df['Runtime'] = df['Runtime'].str.replace(' min', '').astype(int)

# Coerce year to numeric (blanks become NaN)
df['Released_Year'] = pd.to_numeric(df['Released_Year'], errors='coerce')

# Strip commas from gross figures and coerce to numeric
df['Gross'] = df['Gross'].str.replace(',', '')
df['Gross'] = pd.to_numeric(df['Gross'], errors='coerce')

# Save to SQLite for querying
conn = sqlite3.connect('movies.db')
df.to_sql('movies', conn, if_exists='replace', index=False)
conn.close()

Then the SQL - bucketing films into decades and aggregating ratings, runtime, and counts:

SELECT (CAST(Released_Year AS INTEGER) / 10) * 10 AS decade,
       COUNT(*) AS movie_count,
       ROUND(AVG(IMDB_Rating), 2) AS avg_rating,
       ROUND(AVG(Runtime), 0) AS avg_runtime_min
FROM movies
WHERE Released_Year IS NOT NULL
GROUP BY decade
ORDER BY decade;

And for the top directors finding, a HAVING clause filters out directors with fewer than three films, so the list reflects consistency across a body of work rather than one-off masterpieces:

SELECT Director,
       COUNT(*) AS movie_count,
       ROUND(AVG(IMDB_Rating), 2) AS avg_rating
FROM movies
GROUP BY Director
HAVING COUNT(*) >= 3
ORDER BY avg_rating DESC
LIMIT 10;

Takeaways

This was a small project, but it touched every stage of a real data workflow: ingestion, cleaning, storage, querying, visualization, and communication. The most valuable lesson wasn't technical - it was that data verification matters before analysis matters. Spotting the recency bias before drawing conclusions about "decline in cinema quality" is the difference between a real insight and a misleading one.

Code and data available on GitHub.

More Coming Soon