Checkout Co-occurrency
MAT 259, 2023
Lu Yang

Concept
I wanted to look at pairs of bibliographical items that were borrowed together and measurement that can be established based on this concurrency. I believe this is interesting as it will review circulation history and correlate items even though they have irrelevant metadata. From an agent-based modelling perspective, I’m more interested in creating unidirectional correlations and measure concurrence frequency based on Dewey class and subjects.

Query
The Approach
In order to find items that were borrowed together, I assumed that items that had the same checkout and checkin timestamp were likely to be checked out by the same person. It should be noted that it is possible that
1) two individuals checked in/out books at the exact same time
2) librarians hold books at a rotation interval, so batches of books checked in/out by different person at different time may have same timestamp
3) processing time by the same librarian can lead to a few minutes difference on the timestamp
4) someone checked out multiple books but did not return them all at the same time

The Query
In order to perform the search, I looked at the cartesian product of the inraw table with itself and selected bibliographical items that had the same checkout and checkin times as pairs. To measure concurrence frequency, I also looked at 3-digit Dewey class, which contains 1) main class; 2) division; 3) section; and subjects, which are retrieved as a concatenated string from the subject table.
SELECT


t1.bibNumber AS A_bib,
t1.title as A_title,
FLOOR(t1.deweyClass) AS A_dewey,
t3.subject AS A_subject,

t2.bibNumber AS B_bib,
t2.title as B_title,
FLOOR(t2.deweyClass) AS B_dewey,
t4.subject AS B_subject

FROM
(SELECT
bibNumber,
GROUP_CONCAT(subject
SEPARATOR ';') AS subject
FROM
spl_2016.subject
WHERE
subject REGEXP '^[0-9a-zA-Z .]+$'
-- filter out non-English subjects
-- The anchor ^ and $ ensure that you are matching the entire string and not part of it.
-- Next the character class [0-9a-zA-Z .] matches a single upper/lower case letter or a space or a period.
-- The + is the quantifier for one or more repetitions of the previous sub-regex.
-- so in this case it allows us to match one or more of either a period or a space or a upper/lower case letter.
GROUP BY bibNumber) AS t3,

(SELECT
bibNumber,
GROUP_CONCAT(subject
SEPARATOR ';') AS subject
FROM
spl_2016.subject
WHERE
subject REGEXP '^[0-9a-zA-Z .]+$'
-- filter out non-English subjects
-- The anchor ^ and $ ensure that you are matching the entire string and not part of it.
-- Next the character class [0-9a-zA-Z .] matches a single upper/lower case letter or a space or a period.
-- The + is the quantifier for one or more repetitions of the previous sub-regex.
-- so in this case it allows us to match one or more of either a period or a space or a upper/lower case letter.
GROUP BY bibNumber) AS t4,

spl_2016.inraw t1
INNER JOIN
spl_2016.inraw t2 ON t1.cout = t2.cout AND t1.cin = t2.cin
AND t1.title LIKE '%architecture%'
AND t1.bibNumber != t2.bibNumber
AND t1.deweyClass != ''
AND t2.deweyClass != ''
-- AND YEAR(t1.cout) > 2017

WHERE
t1.bibNumber = t3.bibNumber
AND t2.bibNumber = t4.bibNumber

Final result
Findings I have queried the entire dataset and retrieved 3,207,172 rows in 307 seconds. Attached is a sample of the data. The data shows correlation between two identical bibliographical items A and B with unique bibNumbers (copies with different format are treated as one bibliographical item). From this spreadsheet, further concurrence frequency can be counted such as A_dewey to B_dewey, A_dewey to B_subject, A_subject to B_subject.

These three frequencies are important to me because I’m interested in visualizing these correlations in an agent-based swarm simulation. The aggregation of these frequencies will determine the attraction/repulsion force from A to B, not necessarily from B to A, when two random bibliographical items meet.




Code
All work is developed within Processing
Source Code + Data