Faculty Research Request: Data Scraping and Analysis


Introduction

This project is rooted with a FIMS faculty member who is interested in scraping and analyzing data from two social media sources. Specifically, they would like to collect data from the Reddit subreddit r/librarians and Twitter posts that use the hashtag #librarytwitter. The goal is to engage digital scholarship and analyze the data with the intent of gaining some insight into trends on library-related social media channels.


Through this collaboration, we aim to create a strategy for data collection, organization and analysis using a series of open-source software and programming. The library resources and staff allocated to this project can provide expertise and resources that enable research goals, support access to software and technological know-how, and provide service that aims at preserving data and making it accessible for future access and research. 


In this proposal, we will discuss the project objectives, necessary technology and infrastructure, resources, project sustainability, and parameters for project completion. Additionally, after a conversation with colleagues, we will address the concerns about ethical data collection and use. 


For digital scholarship to be successful we need to define what it is and how it can positively affect our academic communities' research. A variety of characteristics need to be considered,

(Sherman Centre for Digital Scholarship)


Objectives

The library’s role and objectives in this project are to support and advance research through a partnership between the library and faculty. We can do that by filling gaps in the faculty member and their team’s technical knowledge of digital scholarship and by empowering them through training in the use of digital tools. 


We aim to:


Infrastructure

The technical components of this project can be split into three sections; data scraping, data analysis, and preservation. Each will requires unique tools, methods, and software. 

*Examples are available in the Appendix. 


● Data scraping 

To scrape Reddit and Twitter of specific data we will use Python. Python is a highly flexible programming language, that emphasizes readability, and contains a comprehensive library of programming tools. We will access Python, via Google Colab, which is a virtual place to write and execute Python in your browser. This makes it easy to share, collaborate, customize and save projects for reuse. 


● Reddit- data scraping

To harvest data from Reddit we will import the following packages from the Python library,


Basically, we will create a Reddit account for the project, connect to it via the “praw” package and extract specifically defined data using “panda”. Using “praw” we can define a subreddit to scrape, change the number of posts that are scraped (for example 50 or 500), and the hierarchy by which they are organized. Such as “top”, “controversial”, “new”, “hot”, or “gilded”. 


● Twitter- data scraping

Twitter data will be extracted similarly, by using a package that connects to Twitter and then formatting the data using Pandas and CSV.


● Voyant- analysis

Voyant is a web-based tool for performing text analysis. It can be used to study text that the user uploads and then performs tasks with, such as identifying phrases, occurrences, and terms, and visualizing data. This will open new avenues for interpretation, and facilitate analysis and data organization more efficiently than conventional research methods. Due to their technical knowledge and training, librarians are ever becoming more engaged with scholars in the education of digital tools, access, and preparation of datasets (Wallace & Feeney, 2018, pp. 24). Using tools such as Voyant meets the library's objectives of inventiveness, interdisciplinary exploration, and collaboration.


For a list of tools and guides visit Voyant Tools Help page


● Institutional repository 

The university’s institutional repository is available to support the preservation of any data created via this project. All data should meet accepted ethical benchmarks of the scholarly community. The university maintains all costs associated with this service. More details about the sustainability and use of this tool can be found in subsequent sections. 


Resourcing

Resources for web scraping, analysis and sustainability of this project will either be open-source or supported through institutional operational costs. 


Maintenance and Sustainability

The maintenance and sustainability of this project are minimal. Something to consider is the ability to replicate scrapes using Python. Unlike software that may become incompatible or unsupported over time, theoretically, the Python code we use for scraping data should be replicable in the future. This provides the opportunity to produce scheduled (weekly, monthly) scrapes or follow-up data sets further into the future. 

The other thing to consider is how the data for this project is conserved. Since the faculty member will be encouraged to upload their project data to the institutional repository, there is no need for this individual project to be maintained as it will be secured, preserved and made accessible as a larger project within the university library.  


Ethical/Legal Concerns

Due to the public, online nature of the data utilized in this project there will invariably be some ethical concerns raised during its collection, analysis, or presentation. While the precedent has been set, with many research projects revolving around social media posts and comments, it should be our goal to identify any possible ethical pitfalls in this project. Some concerns should include, providing anonymity to the users' data, considering the ethics of sharing the project's collected data, and the blurred lines of consent. 


Below are three articles that discuss the ethical challenges of projects that use social media data. It is suggested that we review this literature to ensure the project is, at minimum, in line with current research trends involving the ethics of using social media data. 


  Norman Adams, N. (2022). “Scraping” Reddit posts for academic research? Addressing some blurred lines of consent in growing internet-based research trend during the time of Covid-19. International Journal of Social Research Methodology, 1–16. https://doi.org/10.1080/13645579.2022.2111816 


  Proferes, N., Jones, N., Gilbert, S., Fiesler, C., & Zimmer, M. (2021). Studying Reddit: A Systematic Overview of Disciplines, Approaches, Methods, and Ethics. Social Media + Society, 7(2), 1-14. https://doi.org/10.1177/20563051211019004 


  Reagle, J. (2022). Disguising Reddit sources and the efficacy of ethical research. Ethics and Information Technology, 24(3), 41–41. https://doi.org/10.1007/s10676-022-09663-w 


Sun-setting and Project Completion

A key way to ensure that the project is mutually sun-setted is for each party to have defined roles and boundaries. This will help the project run more efficiently and enable each party to meet their goals and the overall project’s goals. 



Finally, once the project is completed it is vital to hold a post-mortem so that all parties involved can look at the project from start to finish to identify what went right and what can be improved. Based on thoughtful reflection and discussion this is an important and often overlooked stage that will benefit future projects.


Appendix

Python Example 1: Reddit See the code here for an example scrape. This instance scrapes data from the top 500 posts on r/librarians. 


Python Example 2: Twitter See the code here for an example scrape. This instance scrapes data using the hashtag #librarytwitter between 1 December 2022 - 7 December 2022.


● Voyant Example: This screencap shows some of the versatility of Voyant. With multiple views, the user can explore ideas and keywords through visuals and text at the same time. In this example, the focus is on Reddit users + the keyword “advice”. Voyant highlights key terms in the corpus, allows the user to focus on the context of a term(s), and visualizes language use. 

References


Sherman Centre for Digital Scholarship. (n.d.). What is Digital Scholarship? https://scds.ca/what-is-ds/  


Wallace, N., & Feeney, M. (2018). An Introduction to Text Mining: How Libraries Can Support Digital Scholars. Qualitative & Quantitative Methods in Libraries, 7(1), 23–30.

Project Presentation

Web Scrape & Analysis Project