projects

Listening to the Web Crawl: Making Music Out of Web Crawling Data Using SuperCollider

SuperCollider is free and open-source software that allows users to generate audio programmatically, including from messages sent over a local server. In our first project as data visualization interns, we used Python to crawl the web and collect parameters, which we passed to SuperCollider to generate music.

Within the Python portion of the project, we first supplied a starting web page (e.g. http://www.umich.edu) to begin crawling. Utilizing the Beautiful Soup Python library, the script recorded the following data about a starting web page to send as a message over a local server:

  • total number of links on the page
  • size of the file
  • number of divs on the page
  • change of top level domain (between .edu, .com, or other domain name)
  • amount of time to open the link

After gathering data from the starting web page, the script opened the first link found on the page, and the same data was extracted and sent as a message to SuperCollider. Next the second link of the starting page was opened, data extracted and sent, and then the third, and so on. Only after exploring the entire set of links would the script begin to explore the next tier of pages, thus prioritizing the breadth of links over the depth of links.

To turn the web crawling data into music, a note or set of notes was played each time the SuperCollider received a message from the web crawler. The note(s) were played using two digital instrument definitions – called synths – created in SuperCollider. Each synth used the G below middle C as a starting note (to give the outputted sounds a common tonality).

The two synths created were named smooth and chord. For the smooth synth, we scaled the number of links parameter and matched it to an interval across a two octaves range. This interval was added to the base note of G to created a melody note within the G natural minor scale for every message. For the chord synth, SuperCollider generated a chord only in cases when the amount of time to open the link exceeded one second. SuperCollider selected a chord with no inversion, first inversion, or second inversion depending upon the number of divs on the page (utilizing the same scaling strategy as for the melody note).

Lastly, whenever a change of top level domain (i.e. a change from .edu, .com, or other domain name) occurred, the maximum values for number of divs and number of links were reset — this helped avoid a situation where the mapping functions produced only low results because of an extremely high number of divs or links on a single page had skewed the range. After resetting these max values, SuperCollider would play an extra 6th note, just for fun.

Curious as to how our music turned out? Listen to the music of the following web pages here (or explore the source code on GitHub):


Crawling the University of Michigan (umich.edu)

Crawling the New York Times’s website (nytimes.com)

Crawling Comedy Central (cc.com)

We also wrote a modified version of the web crawler to prioritize depth of links (e.g. pursuing every single link found immediately) rather than breadth of links (e.g. waiting to explore all links at a hierarchical level before going to a sub-level).

To compare the sound of depth web crawling to breadth web crawling, listen to this alternate Depth Crawl version of the University of Michigan (umich.edu):

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s