Project Description

Popular Words

MAT 259, 2021

Richard Jiang

Concept

The use and popularity of certain words change over time as culture evolves. In this project, I wanted to explore this evolution using data from the SPL. While, from one perspective, we could better map these by looking at the popularity of words in titles as a function of publication date, this does not capture the direct response from the audience about which words are most attractive and it does not utilize the benefits of the SPL dataset well. The topic I settled on was to visualize the number of checkouts of each word in a particular year. Initially, the hope here was to see how some words would naturally lose popularity over the years. The data is mapped into 3D in the following way: 1. Each word is given a 2D coordinate using Word2Vec and a dimensionality reduction technique called UMAP. Using this, we can algorithmically determine some sort of clusters where similar words are close to each other 2. The 3rd coordinate would be dependent on the relative popularity of that word in a particular year 3. Each year is located within its own reference frame. 4. Absolute frequency of the word among the entire 'active' corpus would be encoded by a color The interactive component allows the user to: 1. Specify the word/words to track over the years - in which case a line will connect the location through time (if it exists in that year) 2. Select a window of popularity to view i.e. top 100, 20 - 50, 100 - 200 which will scale all of the colors/computations into that particular corpus 3. Select color gradients/design elements While it would be great to represent every single word at the same time, and will be technically possible, this not only becomes an incredible computational burden due to the number of words to render, but it also adds a lot of noise making it difficult to interpret and explore.

Query

The query is relatively simple but with a few post-processing steps, which will be attached to this entry.


SELECT 
    YEAR(s.cout) as year, s.title, COUNT(*) as checkouts
FROM
    spl_2016.outraw s
WHERE
        s.itemtype LIKE "%bk"
        AND s.title != '' 
        AND s.collcode NOT LIKE "%comic"
        AND s.callNumber NOT LIKE "CHINESE%" 
        AND s.callNumber NOT LIKE "JAPANESE%"
        AND s.callNumber NOT LIKE "SPANISH%"
        AND s.callNumber NOT LIKE "KOREAN%"
        AND s.callNumber NOT LIKE "VIETNAM%"
        AND s.callNumber NOT LIKE "FRENCH%"
        AND s.callNUmber NOT LIKE "GERMAN%"
        AND s.callNumber NOT LIKE "ITALIAN%"
        AND s.callNumber NOT LIKE "RUSSIAN%"
        AND s.callNUmber NOT LIKE "ARABIC%"
        AND s.callNUmber NOT LIKE "SWEDISH%"
        AND s.callNumber NOT LIKE "PORTUGU%"
        AND s.cout > NOW() - INTERVAL 13 YEAR
        AND s.cout < NOW() - INTERVAL 1 YEAR
GROUP BY YEAR(s.cout), s.title

The data is quite large so a reduced dataset will be attached but the query runs relatively quickly (<15 minutes). All together, this produces approximately 2.9M rows.

Preliminary sketches

Process

Final result

Code

Source Code + Data