suffen.us

data and humanities

On Speech Recognition

This past week I published a piece on Slate with Ethel Hazard on the under-examined assumptions we bring into speech-activated technologies, including the ways that they asymmetrically favor some English speakers over others.

Ironically, shortly after it was printed, I received an email from a marketing representative insisting that I try a new technology that, should I buy some innovative new software, would erase any snags in verbal communications with my laptop or phone.

The only catch is that it costs 200-2000 dollars for a license, and I have to write my own application to package the recognition engine. I certainly get why they contacted me: in one view, I’m complaining about something that doesn’t work, but should, and the fix is something that I really need to know about.

But the problem isn’t finding the right speech processor; it’s the insistence that human computer interaction is solely a technical problem. So many times with artificial intelligence commodities, it’s easy to confuse emerging problems of technology and social justice with technical problems that can be solved through engineering.

Advertisements

Looking outside of Literature: Digital Humanities and Decision Support Systems

DSS

The word “model” can mean a lot of things. From critical theory, we generally inherit the word as a schema of interrelated concepts. From the digital humanities, we can see models as sets of parameters (proxy or directly measurable) that are interrelated and form the principle of definition for an idea, character, or process. A groundbreaking, (though not exhaustive) progression traced from recent pieces by Graham Sack, Franco Moretti, and Ted Underwood (here, here, and here) all take generous and exploratory positions toward modeling from the perspective of the humanities. What’s ultimately at stake is what can be operationalized, measured, and bound together as part of an interdependent schema of variables that may have explanatory value in a literary text. For making critical, cultural, or historical arguments, these approaches to models in the humanities bring new evidence to interpretative practices of literature.

I am relentlessly positive about this direction in scholarship. This kind of modeling is different from say, climate modeling or some forms of network modeling or agent-based modeling, where the goal is to produce simulations based on changes to specific parameters in the model or models at hand. This practice of modeling tends to emphasize prediction. For now, let’s just call these two groups explanatory and predictive models. Yes, we can certainly think of models in different groupings. I don’t really think these two categories are the best, but my purpose is to provoke us to think about what the digital humanities could gain by participating in predictive models in addition to explanatory ones. Sack’s work actually employs some agent-based modeling, but the ultimate end is a simulation of narrative and characterization.

The predictive modeling I’m trying to temporarily cordon off for the sake of discussion makes predictions about non-literary systems with physical, time-based contingencies. Climate, traffic, flooding, fires, emergency response, religious conflict, etc. In addition to models that were aimed at or derived from literary texts, what if we took humanities knowledge-making practices (for both modeling and other kinds of interpretive practices) into the realm of predictive modeling? What if, in addition to using computational methods on literary texts, we used models as ways to take theories and methods derived from the evolving study of literature and have them inform computational models? Here we wouldn’t be interested in a machine reading of Hamlet; we’d be interested in taking the epistemology of humanities research methods and using them to inflect computational models that aim to solve problems in domains like medicine and civic infrastructure. The current work around modeling is crucial to make the latter possible, but what I am describing now is different in that the outcome of the scholarship is not an interpretation of a text(s), but a collaboratively produced solution space posited by a model.

What if a model aimed at simulating economic development around a rail line could consider critical race theory? What if that notion of development were expanded to consider other variables and evidence? What if the evaluation of the overall modeling and decision process included experts from the humanities?

Key to this would be a distinction between the methodologies of humanities and social science. This kind of modeling is not new to social science, but social science has limitations about what it typically models and considers to be objects of study. Prospect Theory might be helpful for modeling some specific kinds of human decisions, but history and cultural theory may help situate that behavior in a broader imaginary that may change a prediction or change the significance of a set of outcomes. It’s the possibilities offered by this shift in emphasis that make decision support systems and intriguing area for digital humanities.

Decision support systems (DSS) are information and computing tools that aid in high-level decisions that must consider multiple and complex datasets and variables. How to treat a tumor, where to build a dam, how to promote economic growth in a given neighborhood, how to mitigate storm damage; how to make an organization more effective; these are all use cases for DSS. Enter the Complex Systems Framework (CSF). Developed by Robert Pahle’s group at Arizona State University, CSF is designed to ingest and arbitrate among multiple models and datasets, as well as the real-time manipulation of the system by collaborating experts.

Workflow diagram of Complex Systems Framework. From complexsystems.com

CSF schematizes decision support as a web of data inputs, computational models, expert adjustments, and visualizations that culminate in a solution space of possible courses of action. Key to this DSS is the ability to include any module in a given decision support environment, making it possible to add new models without necessarily redesigning the entire system. While this is not unique, it is inviting. Developing humanities-informed models and datasets for existing problems or designing a DSS around humanities modules alone is not only possible, but also stageable in a multiscreen decision theater where experts interact with one another en route to recommending a solution.

CSF’s take on decision support is to facilitate a conversation, not spit out the most expedient solution.  The collaborative aspect of decision making in this DSS makes it a viable place for humanities and decision science to explore cooperation.

And this cooperation is, to me at least, one of the most exciting possible payoffs of the engagement of the humanities in modeling; a seat at the (sometimes figurative, this time literal) table when it comes to making decisions and recommending solutions for problems that intersect with the social and environmental grand challenges that face us all.

Bethany Nowitzkie’s keynote for DH 2014 is slated to foster discussion around about the role of DH in the Anthropocene. This area is another one of those areas where I’m relentlessly positive, and the conncetions between DH and environmental humanities still have a lot of growing to do. Let this relatively recent engagement with modeling, while changing how we can read literary texts, also extend in the direction of environmental humanities. Within this broader movement, DSS is one area where there is promise to change how we orient the insights of the humanities and digital humanities.

NB: In the upcoming year I’ll be heading a team at ASU that will include humanities scholars alongside users and designers of CSF. As we begin to develop new modules and reframe problems we’ll be sending out updates to generate conversation specifically around DSS and DH.

Someone Said Something on the Internet!!! : Why Stories about Social Media Can Fail Us

bird

To begin, let’s introduce two interconnected challenges attendant to the task of explaining what happens on social networks.

A narrative problem: Telling stories about activity on networks poses a narrative problem. Narratives are good at unfolding sequences of events, or arranging events into a sequence for communicative purposes.  Networks are capable of a kind of simultaneity and interconnectedness that narratives can struggle to adapt and explain.

A representational problem: When it comes to social media networks, trying to figure out what relevance social media activity has on other realms of culture and behavior can be a tough out.  What a group of tweets really means depends on things that social media can often conceal: history, location-dependent context, non-social-media social environment, etc.  Whether research investigator or critical reader, we know that only looking at social media data can be terribly misleading. We pay attention to maps of geolocated tweets, for instance, but one study observed that 20 percent of Twitter users disclose their location, but depending on what groups one works with, that number can dip lower (close to 1 percent).  And there is no way to precisely predict the distribution of users who do or do not share their location (we surely can try, but this remains a known uncertainty). Add to this mix the 140 character limit per individual tweet and we might be hard pressed to figure out what’s actually there in a batch of Tweets.

But that’s no reason to proclaim that social media stories and analysis have no value.  It just means any social media stories and analysis need responsible embedding within other information.  Otherwise, we don’t know what we’re looking at other than “someone said something on the internet.”  That leaves us with a lot of ambiguity.

When I wrote about the racist, sexist tweeting reported by BuzzFeed, I wanted to call attention to the ways that copy/pasting or embedding tweets as a way of reporting or calling attention to events is an unhelpfully simplistic way of starting a public discussion about a given social media trend.  We know that there is limited data within social media alone, but by reporting Tweets by merely listing half a dozen tweets implies a representative quality to a tweet that may not exist.  For instance, the article in question headlines the Twitter activity at U of I with “After Being Denied A Snow Day, University Of Illinois Students Respond With Racism And Sexism.”  The piece goes on to list 11 total tweets that ranged from cruelty to hate speech, followed by 9 tweets and a few comments by U of I alums that condemn the activity of the anti-Chancellor tweeters.  7 of the 11 spotlighted hate-tweeters have since deleted their accounts.  But this approach of “balancing” an approximately equal number of pro and anti tweets, with each group ranging in intensity, is very limited. First, let’s appreciate the good things this story did: 

1) Called attention to hateful and reprehensible conduct so that it could be addressed, discussed, and punished. It also ended up being, for what it’s worth, a “teachable moment” at the University

2) I’m open to other things to list here.

Now, let’s look at some of the bad things that happened:

1) In a platform defined by connections with other users, we get no sense of the diffusion of the ideas in question. All we know is that some people said some things, and others tried to correct them.  This is a story about activity on a social network, so please, show us the network!

2) As a result of 1), the headline is essentially “someone said something outrageous on the internet.”  People in turn respond with outrage.

3) Our collective response as citizens/consumers of this information is left without the proper tools and knowledge.  If 50 people said terrible things on Twitter, it makes a big difference if they were a) largely ignored; b) retweeted by a zealous few; c) massively propagated across a social network; d) condemned by a vocal few; e) widely condemned by an emergent coalition; f) barely condemned at all.  How we may go about contextualizing and addressing this activity and the agents involved depends on how it happened, not merely that it happened.

And so the business of “reporting” tweets alone is suspect.  If we want to draw conclusions from social media, whether our ambitions are journalistic or academic or both, we should have to consider social media activity within a network context. And this is at bare minimum, really.  There are many cultural, social, and historical dimensions that can and should be added.  If we do not have this context, we are much more likely to answer outrage with outrage.  Someone said something terrible, we are horrified, and we wait for the cycle to repeat.

By examining these things I wish to see our relationship to social media transform.  It can be an incredible resource, but it also a very complex system, layered on top of so many other complex systems.  To insist on representing social media “stories” linearly could be just fine in some scenarios, but it risks harmfully misrepresenting what is happening or what happened.  Without context (informational, cultural, etc) we are left at the mercy of headline writers.  We should demand more out of any “story” that uses Twitter as evidence.

To return to this BuzzFeed piece, which I like for its early action but dislike for its presentation.  “After Being Denied a Snow Day, University of Illinois Students Respond with Racism, Sexism” does not tell the story like “After Racist, Sexist Responses to Snow Day Decision, University of Illinois Yet Again Conflicted on Racial Issues .”  The latter takes fuller consideration of social media activity (detailed previously) and gestures towards the institutional and cultural history of the place. The former uses half a dozen tweets to generate outrage.   If we want to address a brutal but complex institutional problem (like racism), nuance is our friend.  And, as it emerges, a lot of writing about social media just doesn’t supply enough.  In some situations, some may care less, but Justin Bieber saying something foolish on Twitter probably shouldn’t receive similar reportage as evidence of institutionalized white privilege at a major University.  “Here is a trend” is one way to tell a story. “What exactly is this trend?” will better serve us all.

In the predicament of making social media more legible to help sustain a deliberative public environment, I believe we’re all on the same team.  I fervently believe that a site that lists a story entitled “17 Signs You’re in a Relationship with a Burrito” can also bring meaningful and impactful news.  I just think that’s what it means to be a hub for a lot of heterogeneous information. So too do BuzzFeed staffers believe this, who say things like “The media’s new and unfamiliar job is to provide a framework for understanding the wild, unvetted, and incredibly intoxicating information that its audience will inevitably see — not to ignore it.”

Twitter Outrage, Charted: The Partial Anatomy of the #FuckPhyllis Trend, or Why I Don’t Trust BuzzFeed

On Sunday, January 26, Chancellor Phyllis Wise of the University of Illinois at Urbana-Champaign sent out notice that despite below-freezing temperatures, classes at the University would resume according to schedule on Monday morning.  The prospect of attending class in these conditions upset many students at Illinois, prompting an uproar on Twitter reported by BuzzFeed and the creation of the phony @ChanPhyllisWise twitter handle as well as the #FuckPhyllis hashtag.  As evidenced by a dozen or so embedded tweets, the #FuckPhyllis trend contained despicably racist and sexist insults directed at the Chancellor. Buzzfeed also reported that students turned to change.com to produce a petition calling for classes to be cancelled, which garnered over 7,000 signatures over night.   According to social media analytics service Topsy, the #FuckPhyllis hashtag only included some 2,000 tweets at the height of its activity.  And while there is no disputing how reprehensible the conduct of these students is, we should look closer at this trend: who is tweeting, when, about what, and in agreement or to echo whom? BuzzFeed calls as much attention to the hate speech as possible (seemingly implicating every petition signer in the social media hate-fest), but to whom did this hashtag, now effectively dead, ultimately belong?

When considering the scale of Twitter, 2,000 tweets is a very small number of communications.  But the reason #FuckPhyllis is so interesting (and likely the reason Buzzfeed paid attention) is how rapidly it trended:

Screen Shot 2014-01-28 at 3.43.22 PM

Today the hashtag sees barely any activity at all.  But in the lifespan of this particular social media spike there may be some interesting patterns surrounding the network topology of different communities of tweeters. It’s important to suspect that many of the offensive tweets have since been removed by their authors, but many do remain. Still, this is a hole in our data. Assuming many of the most vile things are now gone, all we can do is examine hostile vs. corrective communication networks rather than a deep look at racist tweets of unknown quantity and verbiage.

What follows is a quick look into  the #FuckPhyllis hashtag, the twitter activity surrounding the topic on January 26, 2014, starting at around 10pm CST all the way up to 5pm CST the following day.  I used NodeXL to pull tweets that used the #FuckPhyllis hashtag, and so the limitations of the Twitter API are present: not every tweet is included in this analysis.  Also, embedded tweets appear to refer to UTC rather than CST (making tweets from the 26th appear to be dated on the 27th of January) Still, there are some general trends in the hashtag that are illuminating.

First, lets take a look at a graph of the tweets and retweets from the two hours of the life of the hashtag:

Image1240

In this graph, lines or edges indicate a retweet, and loops indicate tweets that are likely responses to tweets by the same user.  Thus, the appearance of many loops lumped together shows people holding conversations by replying to their own tweets.  Each node is a tweet.  Node colors are determined by a modularity algorithm (Girvan-Newman) that groups nodes by shared connections.  What is shocking about this graph is how many tweets are actually reacting against the original negativity of the #FuckPhyllis trend.  The node highlighted in red, with by far the most retweets, is @suey_park’s retweet about white privilege from U of I student Briana Walker (who is synthesizing comments made by Park earlier on):

In fact, nearly every tweet that has a truly graphical property–retweeted by overlapping communities of agents who in turn are retweeted–rather than a simple hierarchal layout (see the red subgroup in the lower left of the above graph) contains or retweets a message of anti-racism.  This sentiment ranges from simple eye rolling to more sophisticated thoughts on white privilege at U of I.  Within a few hours of the first hateful tweet,  social media conversation was dominated (centered on) by reprimands and anit-racist commentary.  The graph demonstrates that it is the anti #FuckPhyllis tweets that have a high Eigenvector centrality (used in network analysis as one way to study influence), or in other words  a high number of retweets and mentions that also happen to be retweeted and mentioned.  The influence of anti-racist tweets appears to outstrip that of anti-Wise tweets.

One of the most popular remaining anti-Wise tweets (although this contains zero racist overtones) is represented by the aforementioned red subgroup demonstrates several retweets of the following:

Even though the above tweet was retweeted a total of 64 times in its total lifespan, the network reveals that it has little influence over the general social media conversation as it evolves.   Now in the upper left of the next graph,  this particular tweet remains marginal in the following 17 hours worth of tweets:

hashtagfuckphyllis

The dark blue component in the middle of this second graph continues to feature @suey_park (although this time for a different but similarly anti-racist tweet).  @suey_park’s tweet again has the most mentions and retweets by those who are also retweeted and mentioned.  As before, the graph is predominantly an anti-racist backlash, and the most retweeted of the anti-Wise messages is not overtly racist or sexist.

ChanPhyllisWise

By comparison, here is a graph of tweets that include “ChanPhyllisWise”, the title of the now deleted bogus twitter account used to mock the Illinois Chancellor:   Similarly, this graph has been clustered by modularity.  The majority of tweets shown on this graph demonstrate the kind of vitriol reported by Buzzfeed, and here are the two central dark blue and light blue tweets that rest in a pair at the middle of the graph:

Both of these tweets demonstrate a relatively high total degree centrality (retweeted more than peer tweets) but a comparatively low Eigenvector rating (retweeted, but not by those who were themselves retweeted). Furthermore, we can see the topology of this graph to be markedly different from the first two.  Where as the first two showed resonance around a small set of tweets, the third shows significantly more isolated conversations or tweets that were not retweeted at all.  The conversation that was significantly more sexist, racist, and hostile in tone is also one that features more fractured conversations, less information exchange, and lower connectivity among all nodes.

Conclusions

I’m tempted to hypothesize that conversations that feature this kind of hostility in social media have a performative quality to them, and users appear to want to one up one another.  The network topology of the third graph, for what little information we have, suggests self interest and not very much consensus beyond using the same hashtag and adopting an insulting tone.  The first two show preferential attachment to @suey_park and some consensus about who is “right.”  It would be interesting to listen in on similar (and sadly inevitable) Twitter trends as they emerge and again compare the topology of hostile and corrective social media networks and see how they stack up.  From their networks of communication, we can see a difference between principle and anger as motivating principles for social media use.   As Christopher Simeone put it when presented with this case, “principled actors who see themselves as part of a bigger cause or purpose behave differently than those whose only uniting principle is rage and self interest.”  Or perhaps we don’t have enough information to tell yet.

What’s interesting, however, is that the BuzzFeed article, while calling attention to the racist, sexist, and hateful things U of I students tweeted, greatly over-represents the salacious portions of this social media trend.  We’ll never know exactly how many tweets were deleted out of shame, but analysis of this tend shows a swift and harmonic response that obliterated anti-Wise sentiment and replaced it with a new conversation about white privilege.  Yes, U of I students filled out a petition that got 7,000 signatures, but that does not equal 7,000 racists.  Clicking a bubble that tries to get one out of class is very different from taking to social media to spread hate.  The real A missing piece of the story is the response and unity of response to #FuckPhyllis.

Source files from NodeXL:

fuckphyllis hashtagfuckphyllis ChanPhyllisWise

Update 1/28/2014 9:50 MST: I want to encourage everyone to see Kevin Hamilton’s comments below, as they raise some important concerns, and I feel the need to clarify.  I do not believe that the University of Illinois is a place free of racism and white privilege or a place where anti-racism somehow excuses acts of racism.  I do, however, see the data discussed above as  showing a contrast in approaches to communication that may map on to angry vs. principled tweeting, and, crucially, how divided the University can be when it comes to issues of race. While there is persistent and unjust quarter given to racial intolerance on campus (highlighted below by Kevin’s comments), this should not obfuscate that there are principled actors at U of I, and that the story of the place is deep and storied division on racial issues, not thorough moral decay.

Oh, and 1/28/2014 11:20 MST: I’m a U of I alum (2011)

Update 1/29/2014 4:32 MST: The joint statement published today by U of I President and Board Chairman

Update 2/3/2014 11:12pm MST: Part 2, pertaining to social media stories and where they go wrong

Making a Simple Interactive Map Prototype with D3…For Total Beginners Who are Totally Impatient

Screen Shot 2014-01-07 at 9.56.08 AM

The goal of this piece will be to get you up and running with a simple interactive map accessible by web browser.  Rather than make sure you know everything ranging from computing basics to fundamentals of programming to web development to geospatial graphics to data visualization, we’re going to get to a prototype as quickly as possible.  For those who are daunted by the former, the latter approach can produce tangible results which are good for morale, and many people learn better this way.

We’ll be using D3.js, a javascript library (or collection of tools that can be called upon within a script) that helps map information from input files and join that information to Scalable Vector Graphics (SVG) objects.  Your web browser can draw these with the proper instructions by default, but D3 provides the tools to bind data to them.  SVG is different from bitmap or .jpg or other static image formats, in that they are a series of parameters through which to make an image, rather than a pixel by pixel specification of what something should look like.

Some caveats: never draw too many conclusions from a prototype, and I hope this tutorial results in an occasion to make new friends and collaborators, and since you are not an expert in any of the things listed above, you might need some.  This is not a replacement for collaborating with some of the good and wonderful people out there who do this kind of work.

The map we’re wanting to make will plot data from a comma-separated values file (we can make this in Excel) onto what’s known as an svg image (we’ll talk about how to get one of these soon).  This map will be accessible to anyone with an updated web browser and an internet connection.  The result we will aim for is something like what’s positioned above: a map of a place divided into subsections that are color-coded by value.  We’ll add mouseovers too so that we can display labels and other information as people explore.

I want to make a distinction among the kinds of instructional articles for this kind of material, and where this one fits in.  The most frequent (the kinds that are most helpful to those already with some web and computer background)  put all of their helpful materials into Github and invite you to clone their repository or even compile their code on your own.

There are many, many students of web development who are in the debt of posts like this. But they may not work for you.  You do not know what Github is, or even if you do, you are frustrated that you cannot download anything from this website.  You don’t want to learn everything about javascript, and eventually you plan on getting a really thorough foundation on the models and principles underlying visualization libraries and tools, but that day is not today.  What you want to do now is make a damn map so you can go back and do what you were doing before you had this idea to try out mapping.

So here’s the obligatory mention of wonderful introductions to Github, D3.js, and javascript.  You really should read them sometime.  And more. But not now.

[update December 1, 2015] For more information about data visualization, see a related project, Stories from Data. 

What you will need

Firstly, you’ll need a few pieces of software/web services that you will hold on to after this project.

-The first is a text editor.  Sublime Text is brilliant. Notepad++ and Textwrangler are also very good.

-If you don’t have a dropbox account, sign up for one here.

-Microsoft Excel or OpenOffice

If we were in an ideal world, we’d have our own webservers and sftp our content to our host machines, and we wouldn’t use a spreadsheet program to manipulate data (We’d use R).  But that’s a lot of work to do, R is notoriously tricky for beginners, and your website is hosted by WordPress or Blogger. You really want this map to materialize. So for now, these are the tools we will use.

At this point you either have some data you want to spatially visualize, or you don’t.  It’s important to note that for this purpose, that data will have to be text or numbers.  If you want to make a complex overlay of images, text, video, and draw animated lines connecting places to show complex relationships over space, you certainly could do that using the tools we have here, but it is beyond the scope of this piece.  DH press might be a good start if that’s what you need.

Here, we are going to make what’s known as a Choropleth, the kind of map where we divide space up into regions and color-code the regions by value.  Again, there are so many kinds of maps, but our goal is to get you up and running with a prototype that you can learn from because it works, not attempt to transform you into a programmer or web developer.

The data to use

So back to the issue of data.  You’ll need 3 main pieces:

-a special geospatial file called a Topojson.  If you want to make your own (eventually, you might have to), this tutorial  and this tutorial are very helpful. Otherwise there will be some posted at the bottom for download (besides the us.json we use here) as time goes by. We’re going to use a file that is comprised of all US counties.  If you go to this link and copy the contents into a Sublime Text file and save it as “us.json” (or anything .json), then you’ll have this step wrapped up.

-Since we have a county map of the USA, we’ll need some data that is broken down by county.

Screen Shot 2014-01-07 at 10.45.45 AM

A good start for more data like this is here.  To decode what all of the labels in the column headers are, try this file. 

Just remember that the us.json file we’re using only has counties drawn. When you’ve selected a column that has data you find interesting, delete entries for states for any data you choose and copy the information into a file that looks like this one. The states are listed in all caps and have no values associated with them, so they’re easy to pick out and delete en masse.  Keep in mind you could keep a whole lot of county data (multiple columns) and read selectively from them when you get to making your visualization.  For this tutorial I’ve pared down the .csv to include the number of family and independently owned farms, by county, from the 1992 US Census.

Important: Because we’ve worked with US Census data, NIST has already ensured that two very important things are true:

1) Our topojson has regions designated with Federal Information Processing Standards (FIPS) numbers.  Which,

2) Coincide with the Census data pulled from each county.

Whatever .csv data you have, make sure that there is a column that associates the information to a naming or id standard that is also present in your map/topojson. Working with federal atlases and US Census ensures that this is the case, but wandering outside of the government’s standards will require that you make sure these numbers match up.

So, you’ll want to place both of these files into a folder that you’ve made.  The next thing you’ll do is open a blank text file in Sublime or whichever program you’re using, and paste in the following code included at the bottom of this post.

WARNING: before you paste this code, be advised that your columns must be titled identically to the example files above, and your csv file that contains those columns is named identically too.  Your .json file should be named “us.json” as well.   You could name your variables (column headers, file names, variables within the code itself) anything you like once you become more comfortable.

A tour of the code

First we’ll begin with the header, which will announce that this is an html and not a plain text document, and supply some information about the language (english) and character set (utf-8) contained herein.


<div></div>

<meta charset="utf-8" />

Independent Farms by County - Choropleth
<div></div>
<div><style><!--

Note that in the final section of the header we’re pointing the web browser to some javascript files that will do important work for this web page.  The first will let us map data into svg graphics (d3.js), the second lets us work with multiple input files for a javascript script (queue), and the third helps us interpret the instructions in us.json as an image in the browser (topojson).

The next section defines CSS for objects on the page.  You can read more about CSS here, but basically what we’re doing here is defining style rules for objects that have given labels:


path {
stroke:white;
stroke-width: 1px;
}

body {
font-family: Arial, sans-serif;
}

.legend {
font-size: 12px;
}

div.tooltip {
position: absolute;
text-align: center;
width: 150px;
height: 25px;
padding: 2px;
font-size: 10px;
background: #FFFFE0;
border: 1px;
border-radius: 8px;
pointer-events: none;
}
--></style></div>
<div></div>

“Path” refers to lines drawn as instructed by our topojson file (us.json).

“Body” helps define text included in the area of the html page defined as “body”

Notice that .legend and .tooltip refer to objects we’ll designate with our javascript, but we can still set what they’ll look like here in the CSS.

The next section begins the body of our page, in which we’ll embed javascript.  You’ll see the title of the page, followed by the designation that what follows is a script written in javascript to be executed when the page loads.

You’ll see a lot of “var=”, which is setting up our variables for the code.  Note that the first of the variables affect what values map to what colors.  See that changing up these variables is an easy way to change the appearance of this map (as well as the CSS).  Colors are coded by RGB HEX value (make your own gradients here).  There are multiple ways to scale colors, but this is the one we’ll go with here.

<h1>Independent Farms in the USA</h1>
<pre>
var width = 960,
height = 500;
var color_domain = [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000]
var ext_color_domain = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000]
var legend_labels = ["< 500", "500+", "1000+", "1500+", "2000+", "2500+", "3000+", "3500+", "4000+", "4500+", "5000+", "5500+", "6000+"]
var color = d3.scale.threshold()
.domain(color_domain)
.range(["#dcdcdc", "#d0d6cd", "#bdc9be", "#aabdaf", "#97b0a0", "#84a491", "719782", "#5e8b73", "#4b7e64", "#387255", "#256546", "#125937", "#004d28"]);var div = d3.select("body").append("div")
.attr("class", "tooltip")
.style("opacity", 0);

var svg = d3.select("body").append("svg")
.attr("width", width)
.attr("height", height)
.style("margin", "10px auto");
var path = d3.geo.path()

The “svg” variable is crucial here: it’s a designation for  joining of a to-be-specified svg graphic with the body of the html page. D3 will let us map data from our files onto this svg designation.  Also note that the “path” variable is calling on a capability of D3 to draw lines based on geospatial information fed to it by our topojson.  If we were to change our .path CSS information, it would change how these lines were drawn.

The next section prepares our files to be read by D3 and plotted onto our SVG “canvas.”


queue()
.defer(d3.json, "us.json")
.defer(d3.csv, "data.csv")
.await(ready);

And the section after that will do some very important work: set up two blank containers, and fill them with an array of pairs.  Each pair will be by “id” (the same as the name of our column headers from our .csv).  The result is a list of value by id number that we can call on later.  A .csv file alone does not accomplish this.  This step maps the .csv file so that it is legible as associated values.


function ready(error, us, data) {
var pairRateWithId = {};
var pairNameWithId = {};

data.forEach(function(d) {
pairRateWithId[d.id] = +d.rate;
pairNameWithId[d.id] = d.name;
});

d.rate and d.name refer to the column headers of our .csv.  There’s a “d” before them because it’s a default way of referencing data you’ve mapped into D3 with javascript.  See how we read in “us” and “data”? “d” refers to that data parameter, which in this case is our .csv that (while not necessary) is also named “data.”

And now we’ll select the svg objects we’ve created but not specified, and map our data onto them:


svg.append("g")
.attr("class", "region")
.selectAll("path")
.data(topojson.feature(us, us.objects.counties).features)
.enter().append("path")
.attr("d", path)
.style ( "fill" , function (d) {
return color (pairRateWithId[d.id]);
})
.style("opacity", 0.8)

This will draw each county as an object, each with its own values.  Notice that we’ve named this class of object “county.” If we wanted to change the style of the counties in CSS up at the top, we could just refer to .county and make changes.  Also, the “.data” line associates information from our us.json file with the county objects (the stuff in parentheses refers to the way the topojson hierarchizes information and points the script to the right container in the hierarchy).

Also important is that “color” refers to the function set above in the code (up in the section with all the “var= ” business).  “Color” expects a number as input, but instead of a specific number, we’re going to give it our container filled with pairs of ID numbers and rate values (in this case, it’s family and individual farm counts for each county), and use [d.id] to make sure that we read in a value for each id number.

The rest is what happens when the mouse glances over the county:


.on("mouseover", function(d) {
d3.select(this).transition().duration(300).style("opacity", 1);
div.transition().duration(300)
.style("opacity", 1)
div.text(pairNameWithId[d.id] + " : " + pairRateWithId[d.id])
.style("left", (d3.event.pageX) + "px")
.style("top", (d3.event.pageY -30) + "px");
})
.on("mouseout", function() {
d3.select(this)
.transition().duration(300)
.style("opacity", 0.8);
div.transition().duration(300)
.style("opacity", 0);
})

Notice how we’re calling the county names and farm counts with a similar technique as before.  The “div.text” will behave according to our “div.tooltip” CSS style that was established at the top.  The duration of the transition (which in this case transitions from less to more opacity, creating a highlight effect) is listed in milliseconds.

And now, to draw the key that explains what each color means.  If you want to change what each label is, make sure to adjust the variable “legend_labels.”


var legend = svg.selectAll("g.legend")
.data(ext_color_domain)
.enter().append("g")
.attr("class", "legend");

var ls_w = 20, ls_h = 20;

legend.append("rect")
.attr("x", 20)
.attr("y", function(d, i){ return height - (i*ls_h) - 2*ls_h;})
.attr("width", ls_w)
.attr("height", ls_h)
.style("fill", function(d, i) { return color(d); })
.style("opacity", 0.8);

legend.append("text")
.attr("x", 50)
.attr("y", function(d, i){ return height - (i*ls_h) - ls_h - 4;})
.text(function(d, i){ return legend_labels[i]; });

</script>
</body>
</html>

With this we designate an unspecified group of svg objects as “legend”, associate this group with data from our variables, then attach rectangles and text that are bound to that data. This selection of objects and binding of data to them is what makes D3 so exciting, among other things.

Wrapping up and posting to the web

When you have your html file saved, give it a name and place it in your folder that contains us.json and your data.csv.  Follow these instructions and place the contents of your folder into the “public” folder of your Dropbox to begin hosting the webpage containing your map.

And there you have it.  Swap out files, tweak variables, edit the style: get this to work and then work on changing and breaking it.  Include hyperlinks or interesting text in your mouseovers. Represent more than one value. And so on. After all, sometimes it’s more fun to read about new things in the context of “what can I get my project to do now” rather than “time to learn everything.”

The full code:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Independent Farms by County - Choropleth</title>
<script type="text/javascript" src="http://d3js.org/d3.v3.min.js"></script>
<script type="text/javascript" src="http://d3js.org/queue.v1.min.js"></script>
<script type="text/javascript" src="http://d3js.org/topojson.v1.min.js"></script>

</head>
<style>

path {
stroke:white;
stroke-width: 1px;
}

body {
font-family: Arial, sans-serif;
}

.city {
font: 10px sans-serif;
font-weight: bold;
}

.legend {
font-size: 12px;
}

div.tooltip {
position: absolute;
text-align: center;
width: 150px;
height: 25px;
padding: 2px;
font-size: 10px;
background: #FFFFE0;
border: 1px;
border-radius: 8px;
pointer-events: none;
}
</style>
<body>
<h1>Independent Farms in the USA</h1>
<script type="text/javascript">
var width = 960,
height = 500;
var color_domain = [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000]
var ext_color_domain = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000]
var legend_labels = ["< 500", "500+", "1000+", "1500+", "2000+", "2500+", "3000+", "3500+", "4000+", "4500+", "5000+", "5500+", "6000+"]
var color = d3.scale.threshold()
.domain(color_domain)
.range(["#dcdcdc", "#d0d6cd", "#bdc9be", "#aabdaf", "#97b0a0", "#84a491", "#719782", "#5e8b73", "#4b7e64", "#387255", "#256546", "#125937", "#004d28"]);

var div = d3.select("body").append("div")
.attr("class", "tooltip")
.style("opacity", 0);

var svg = d3.select("body").append("svg")
.attr("width", width)
.attr("height", height)
.style("margin", "10px auto");
var path = d3.geo.path()

queue()
.defer(d3.json, "us.json")
.defer(d3.csv, "data.csv")
.await(ready);

function ready(error, us, data) {
var pairRateWithId = {};
var pairNameWithId = {};

data.forEach(function(d) {
pairRateWithId[d.id] = +d.rate;
pairNameWithId[d.id] = d.name;
});
svg.append("g")
.attr("class", "county")
.selectAll("path")
.data(topojson.feature(us, us.objects.counties).features)
.enter().append("path")
.attr("d", path)
.style ( "fill" , function (d) {
return color (pairRateWithId[d.id]);
})
.style("opacity", 0.8)
.on("mouseover", function(d) {
d3.select(this).transition().duration(300).style("opacity", 1);
div.transition().duration(300)
.style("opacity", 1)
div.text(pairNameWithId[d.id] + " : " + pairRateWithId[d.id])
.style("left", (d3.event.pageX) + "px")
.style("top", (d3.event.pageY -30) + "px");
})
.on("mouseout", function() {
d3.select(this)
.transition().duration(300)
.style("opacity", 0.8);
div.transition().duration(300)
.style("opacity", 0);
})

};

var legend = svg.selectAll("g.legend")
.data(ext_color_domain)
.enter().append("g")
.attr("class", "legend");

var ls_w = 20, ls_h = 20;

legend.append("rect")
.attr("x", 20)
.attr("y", function(d, i){ return height - (i*ls_h) - 2*ls_h;})
.attr("width", ls_w)
.attr("height", ls_h)
.style("fill", function(d, i) { return color(d); })
.style("opacity", 0.8);

legend.append("text")
.attr("x", 50)
.attr("y", function(d, i){ return height - (i*ls_h) - ls_h - 4;})
.text(function(d, i){ return legend_labels[i]; });

</script>
</body>
</html>

Welcome

This site ss a notepad for sharing information and floating experiments in thinking through qualitative and quantitative methods at the same time to address multifaceted problems.