Piazza Party - How it was Built

This was a pretty neat project. We touched a lot of technologies and learned a lot of new things. As developers, we were super appreciative for all the helpful posts, open-source code, free APIs, and *good* documentation. We decided to do the same for our project!

Acquiring Piazza Posts

You might have already guessed it, but the answer is web scraping and a lot of it! We wrote a python script using Beautiful Soup and PhantomJS. Before using PhantomJS, we used Selenium since it simulates your script by opening a browser where you can clearly see what is going on. We recommend writing your script to work with Selenium and then using PhantomJS. A few helpful links: https://www.youtube.com/watch?v=XQgXKtPSzUI&t=161s, http://stackoverflow.com/questions/8255929/running-webdriver-chrome-with-selenium. *Make sure to close things properly* - our server stopped working because there were too many hanging processes from the web-scraping scripts that crashed or failed initially. You can always confirm whether you’ve closed things properly by doing a “ps -ef | grep phantom” and then killing that process with “kill -9 (process id#)”.

We originally thought that we would automate this script every 24 hours, but for one class with more than 1000 posts, the script would often take about an hour long and imagine scraping data for 10 classes! How we resolved this was initially populating our database with all the data and then writing another script that grabs the “pinned”, “this week”, and “last week” posts from each course on Piazza. This script that populates our database reads in these newly generated files and only adds new posts. We wrote a cron job that runs these scripts and also sends an email to the team if the job failed or succeeded.

Chrome Extension

*We heavily relied on Chrome’s documentation*

The first thing we did was build a very simple chrome extension following this tutorial. The next step was looking into “page actions” which means that the chrome extension can only be used on specific pages (in our case piazza.com). After that, we needed to figure out how to extract text as a user types into the “title” and “question” input boxes to send to our backend for processing. In order to interact with the Piazza webpage, we needed to use a content script that allows the developer to inject javaScript and CSS code into the webpage - this was how we got access to the elements to extract the text being typed in real time. This was also how we displayed the sidebar on the piazza website. We made use of Chrome’s messaging capabilities and POST requests to relay information to and from the back end. From there, we used our content scripts to display results and add additional features to our sidebar.

Algorithms & APIs

We first generated a corpus by scraping several Piazza pages. The corpus we generated had about 400k words! We then calculated the frequencies of each word and then stored them in our database. As we populated the database with Piazza posts, we found which words were “key words” based off some threshold that was compared to the calculated inverse term frequency. By taking the inverse term frequency, we diminish the weight of terms that occur very frequently like a, the, to, etc. and increase the weight of words that occur rarely. We then stored each post corresponding to the key word it contains.

As input is generated from a user, we process each word and also look at each word’s related words - for ex. Run vs running vs ran. With this set of words, we then query our database for all questions that correspond to each word. With these returned list of questions, we used a bag of words model to understand how similar the questions generated are similar to the question asked. We take the interesection of the set of words and divide that number by the total number of words in the generated question. We improved this analysis by also taking into consideration synonyms for ex. Sad, upset, disappointed. We preferenced questions with titles that had a higher bag of words score since we observed that Piazza users often concisely put the most important words related to their question in the title and then elaborate more in the question. We broke the ties of similarly asked questions with how recently the question was asked. Wolphram Alpha had some cool and free APIs for getting synonyms and related words.

Data Visualization Page

The data visualization page was built with BootStrap. We used a word cloud API to represent the top 50 keywords within a given date range and for a particular course. The API took care of coloring and amplifying words based on their frequencies. We also display the top 10 repeated questions by recording the questions a user clicks while using the tool.

Environment

We used Amazon Web Services’ RDS for our database. We set up a MySQL instance on the AWS free tier, and connected to the database in our Java application using a JDBC driver connection. We populated the database with the necessary information received from the web-scraping. To deploy our application, we used an AWS EC2 instance to host our code on a Ubuntu virtual machine. There is a lot of documentation for AWS, but this article was particularly helpful. We then used NGINX to run our program on the cloud as a reverse proxy machine.

Learn about how we made Piazza Party

We'll go over what technologies we used and the pipeline.

Acquiring Piazza Posts

Chrome Extension

Algorithms & APIs

Data Visualization Page

Environment