Lost and Forgotten Books
MAT 259, 2016
Hannah Wolfe

Concept
I thought it would be interesting to see what books were only checked out once, sometimes due to never being returned. Originally I wanted to look at books that were never checked out, but I could not find a book that fulfilled this criteria, due to the dataset being checkout history, not the entire library collection.


Query

SELECT 
    c, checkout, checkin, title, callNumber,
    CASE WHEN callNumber LIKE '%SCI-FIC%' THEN 1 ELSE 0 END AS sci,
    CASE WHEN callNumber LIKE '%MYSTERY%' THEN 1 ELSE 0 END AS mys,
    CASE WHEN callNumber LIKE '%WESTERN%' THEN 1 ELSE 0 END AS wes,
    CASE WHEN callNumber LIKE 'FIC%' or callNumber LIKE 'CAS FIC%' or 
              callNumber LIKE 'CD FIC%' THEN 1 ELSE 0 END AS fic,
    CASE WHEN callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' or 
              callNumber LIKE 'CD YA%' THEN 1 ELSE 0 END AS YA,
    spl_2016.transactions.bibNumber, t.itemNumber
FROM
    (SELECT 
        COUNT(checkOut) AS c, spl_2016.callNumber.itemNumber
	FROM spl_2016.callNumber
	INNER JOIN spl_2016.transactions 
	      ON spl_2016.transactions.itemNumber = spl_2016.callNumber.itemNumber
	where callNumber LIKE '%SCI-FIC%' OR callNumber LIKE '%MYSTERY%' OR 
	  callNumber LIKE '%WESTERN%' or callNumber LIKE 'FIC%' or 
	  callNumber LIKE 'CAS FIC%' or callNumber LIKE 'CD FIC%'or 
	  callNumber LIKE 'YA%' or callNumber LIKE 'CAS YA%' or 
	  callNumber LIKE 'CD YA%'
	GROUP BY spl_2016.callNumber.itemNumber
	ORDER BY c) t
	
    INNER JOIN spl_2016.transactions 
        ON spl_2016.transactions.itemNumber = t.itemNumber
    LEFT JOIN spl_2016.callNumber 
        ON spl_2016.callNumber.itemNumber = spl_2016.transactions.itemNumber
    LEFT JOIN spl_2016.title 
        ON spl_2016.transactions.bibNumber = spl_2016.title.bibNumber
    LEFT JOIN spl_2016.deweyClass 
        ON spl_2016.transactions.bibNumber = spl_2016.deweyClass.bibNumber
WHERE
    c = 1 AND YEAR(checkout) > 2004 and deweyClass = ""
ORDER BY checkout; 

/*This just gets the checkouts for the items that have only been checked out once*/
/*Includes 114 YA bibliographies that we might want to get rid of */
/*runtime duration: 684.374 sec, fetch: 0.170 sec, 18757 rows */

Query/Code Design
I first looked at this by creating a sub table which counted how many times all books were checked out. I decided to ignore books with ILLM in the title because I assume they are Interlibrary loan books or something similar. I then joined the sub table with the transactions table to get the checkin and checkout date of those books and filtered it by where the checkout count was one. This was interesting but didn't seem to give me much useful information about those books, so I decided to group the results by section and type.

Again I found that the items that were most likely to be only checked out once were adult fiction, followed by juvenile fiction and dcillb (which I don't know). The Dewey books that were most likely to be checked out only once were "895 Literatures of East & Southeast Asia", "891 East Indo-European & Celtic literatures" and "811 American poetry in English". These subjects again are literature. Outside of literature in the top 10 types were "658 General management" and "629 Other branches of engineering". This data may not be relevant because I don't know how many of each type of Dewey class and item type there are, so novels might be the most to be only checked out once, because there are more novels.

I then modified the query to use item number instead of bibNumber, select only fiction, and extract genre information from the call number. I found that there are books that have been checked out repeatedly and never returned. For example there are 58 items for "Moonglow a novel" by Chabon, that have been checked out and never returned. 25 items of "books on bikes that were checked out only once. 24 items of "best American short stories and the yearbook of the American short story" that have only been checked out once. 23 copies of "sellout" have only been checked out once. There is a chance that these are errors, but according to the catalog there are 108 copies of "best american short stories", so it doesn't seem unreasonable that 1/4 of them have only been checked out once. There are 14890 distinct bibNumbers for the 18757 items. 2469 are books that have multiple items that has only been checked out once for the same bibNumber.

After the query was run I wrote a python script to extract information from the dataset. I did this because the query already took so long I didn't want to make it any more complicated and I didn't want to rerun it. Also, python has a lot of nice string manipulation functionality that mysql doesn't. I used the script to get the item type, year if available, and author's name from the call number. I also added a field for if it had been returned, and calculated checkout and checkin dates relative to 1/1/2005. A lot of this I could have done in sql but it was easier to add to the python script then wait 10+ minutes for the query to run.

I spent quite a bit of time trying to get a query that would find books that had been checked out multiple times and not returned and found that some items had been not returned many times. These inconsistencies in the data turned out to make this not a viable dataset. This does mean potentially that the books labeled as "stolen/lost" may have been returned, but either way they have only been checked out once and fit within the critera for this visualization.

I put the query, raw data, python script and final data in the zip file with the processing code.


Preliminary sketches
I normally don't really sketch but I found these notes on scratch paper by my desk. I first mapped out what type of books are potentially part of this dataset.



I originally wanted to have a line per item but I found that I had to group them by first two letters of the last name in the full view, and last name in the sub views.




Process




Author name or letter is the Y axis and time is the X axis. Circles are books checked out that were never returned and rectangles are books that were only checked out once with length representing the checkout length. I rounded the length to 3 weeks with the minimum length as 3 weeks so that the checked out items are visible. The circles diameter is also the length of 3 weeks. According to the library website, when you checkout a book you can have it for 3 weeks and renew it twice for up to 9 weeks total. The Y axis is grouped by the first two letters of the author's name in the main view, and by the author's name in the sub views. I chose the colors red, blue and yellow with an opacity of 50/255 for the views with multiple types and just black for the views with only one type of data. I chose these colors because they are very distinct, clean and mix well together. I chose yellow for the most popular type in the visualizations because it overpowers the screen less.


Interaction Design
Keyboard Interaction:
Pressing the numbers 0-9 show 10 different views of the data.
 0 charts the items in the categories of "genre", "fiction" and "young adult".
 1 charts the items by item type, "book", "CD" and "casette"
 2 charts the "genre" books in the categories "mystery", "western" and "science fiction"
 3 charts the CDs and Casettes only
 4-9 charts "books", "science fiction", "mystery", "western", "fiction" and "young adult" separately
Pressing a-z filters the chart to the authors who's books start with the typed letter. Pressing the spacebar resets it to be able to see all the books.

Mouse Interaction:
I added hover text so that the viewer can see the first two letters of the author's name, or the author's name for each row. This allows viewers to easily identify if books are by the same author.

I added a legend at the bottom of the visualization, because the color meanings change based on the view.

Final result
The first view shows books split into three categories, genre books, young adult books and fiction books.



The second view shows the data by item type, book, cd or cassette.



The third view shows the genre books categorized into science fiction, western and mystery.

Pressing a letter on the keyboard shows only books by the author who's last name starts with that letter. The mouse can be used to show hover text to find out the author's name.


Evaluation/Analysis
Books checked out for years and returned later happened a lot before 2010, but not much after. Maybe these books were returned earlier but there was an issue with the checkin return system. Most of the books that were checked out once and returned later after 2010 were checked out for less than a month. There were no books checked out and not returned before 2006. There are a lot of books that have been checked out recently and not returned, most likely because they are being read. There were a lot more books checked out only once than tapes or CDs. For the genres there were a lot more Mysteries, and science-fiction books checked out only once than westerns. This might show overall lack of popularity for westerns or there may have a smaller western collection. Also for the western books there is only one book that was checked out for over a year. Comparing books on tape to CD it looks like people stopped checking out tapes in 2009 though CDs are increasing in popularity. Certain authors and books have been checked out multiple times only once, like Salinger's "Catcher in the Rye", Jane Austen, Beatty's "Sellout", Chabon's "Moonglow" or "Best American Short Stories". Sometimes the author is not actually the author like for "Best American Short Stories" the call number has "Best" instead of the author's name. There are also a ton of copies of "Books on Bikes" and "Open Air" that have no author name, though those are the only ones labeled FICTION without an author in the call number. Also weirdly a ton of fiction books which the first letter of the authors name started with the letter B were checked out and returned in the beginning of 2013.


Code
Built with Processing 3.0.1
Source Code + Data