MAT 259, 2016
In this assignment, I wanted to explore the relationship between historical weather data of Seattle and check out information about the Library. By doing so, I wanted to question whether there is a correlation between the semantic meaning of the word and its average temperature values.
I started everything by acquiring the historical weather data for Seattle, which is downloaded from National Oceanic and Atmospheric Administration(NOAA)
's Surface Data Hourly Global
). The dataset contains entries starting from 2005 April and extends up to today. After getting rid of unnecessary columns, I ended up with the following dataset:
DATE: Date of current item,
TIME: Time of current item,
DIR: Wind direction (0,360), 0 is North,
SPD: Speed of the wind in mph,
TEMPF: Temperature in F,
TEMPC: Temperature in C,
SLP: Sea Level Pressure
The data can be found in data
folder of sketch as weather_05_17.csv
After that, I retrieved items in the library and concatenated their Subject
fields for each 6-hour period. This query took 5 hours in total and since single query that covers 2005 and 2017 timeouts, I had to query 6 month intervals and stitch resulting CSV files together. Resultant CSV file and the query as follows:
SEPARATOR '\n') 'subjects'
s.bibNumber, s.subject, t.checkOut
JOIN spl_2016.subject s ON s.bibNumber = t.bibNumber
t.checkOut >= '2005-04-19'
AND t.checkOut <= '2017-01-01') AS sth
GROUP BY DATE(sth.checkOut) , FLOOR( HOUR(sth.checkOut) / 6)
Resultant query can be found in data
folder of sketch as words_05_17.csv
I, then, preprocessed this document in Python in order to get frequencies of each word in subject fields. While doing that, I utilized stemming and lemmatization module of NLTK library (website
), which reduces inflectional forms and sometimes derivationally related forms of a word to a common base form (i.e. cars
). This process gave me a list of words with their frequencies for the given time period. The code (getFrequencies.py
) and resulting CSV (nltk_05_17.csv
) are in data
folder of sketch.
Idea explorations and class diagram is below.
class in the diagram keeps related information about each of the words, and the code keeps a collection for unique keywords by updating necessary fields if a duplicate occurs. Once file reading is completed, I sprinkled the word collection in space with respect to their Average Date
, Average Temperature
and Average Frequency
values. Since frequency values are too condensed towards lower values, I used the log scale for that axis.
Whenever the user starts to type something with the keyboard, the system will search the phrase inside of the keywords (Regex is allowed
). On the right-hand side of the GUI, the search results are listed, and on the left side, the first word of the list is detailed. Also in the word cloud, search results are highlighted. The Time Barcode
shape at the detail panel denotes usage dates of the current word, and the red line corresponds to its Average Date
. Time barcode starts from the beginning of the data on its left (2005 April) and goes up to January 2017.
In addition to these, it is possible to filter what's drawn inside the cube with sliders at the bottom-left.
The left-most buttons focus the camera to a specific axis. Invert Colors
option changes the color scheme from dark to light. Auto Rotation
rotates the camera automatically, and Color Search
option paints the words that are the result of the current search term in terms of their average temperature values.
After using the visualization, a couple of minutes later I discovered that it shows a pattern on low-frequency items. Since these items only appear once in the dataset, their average temperature is the temperature they are used in. Since this is the case, Date-Temperature plane clearly shows seasonality (winter-summer differences in terms of temperature).
Also examining words like hot
reflects interesting figures.
||52.02F -- 11.12C
||49.62F -- 9.79C
||52.35F -- 11.31C
||49.99F -- 9.99C
||51.02F -- 10.57C
||50.78F -- 10.43C