thirteen ways of looking at a catalog (in verse)

I’ve said in the past that if people are going use Wallace Stevens’ “Thirteen Ways of Looking at a Blackbird” as a literary conceit to introduce something they’ve written, it would be nice if they would at least put in a little verse. I was reminded of this today because I finally sat down and read Lorcan Dempsey’s “Thirteen Ways of Looking at Libraries, Discovery, and the Catalog: Scale, Workflow, Attention“, which appeared last December in Educause Review. It’s an excellent article and if you’re at all serious about libraries, library catalogs, and the directions they may go in the future, you should read it.

But I should warn you: you won’t find any “thirteen ways”-style verse there. So, because I apparently didn’t have anything better to do this evening, I came up with this*:

I

Among twenty open networks,
The only local thing
Was the eye of the catalog.

II

I was of three minds,
Like a catalog
For which there are three interfaces.

III

The catalog merged with the larger web.
It was a small part of the data mine.

IV

A search and a suggestion
Are one.
A search and a suggestion and a catalog
Are one.

V
I do not know which to prefer,
The unity of experience
Or the diversity of resources,
The catalog interface
Or just discovery.

VI

Metadata filled the long record
From abundant streams.
The data for the catalog
Crossed it, to and fro.
The model
Traced in the data
An interoperable dream.

VII

O traditional library,
Why do you imagine one-stop sites?
Do you not see how the catalog
Can work through the flow
Of the people about you?

VIII

I know external sources
And local, curated collections;
But I know, too,
That the catalog is involved
In what I know.

IX

When the catalog slipped out of sight,
It marked the edge
Of one of many layers.

X

At the sight of catalogs
Aggregated in a network,
Even the champions of serendipity
Would cry out sharply.

XI

He drove around the region
Storing books off-site.
Once, a fear pierced him,
In that he mistook
The shadow of a printed page
For a catalog.

XII

Network sources are growing.
The catalog must start linking.

XIII

They were sourcing all around the room
They were scaling
And they were going to scale.
The catalog sat
In the server rooms

—–

*I tried to make the points match the verses in order, even at the cost of some prosody. (It’s not like I had some artistic vision to compromise, anyway.) If I’d done this from scratch, I’d have made different choices.

a bunch of stuff I said about archives

It’s a sign of how much I’ve been neglecting this blog that even though I did an email interview last month about archives, and even though I posted links to it from my social media accounts, I’m only now mentioning it here. I don’t know if anyone reads this blog at all who doesn’t follow me on twitter, but if you do, then presumably you’d be interested in reading the interview.

My interviewer was Roisin O’Brien, who is a Masters student in Digital Arts and Humanities at University College Cork (UCC) in Ireland. The topic was “What is the role of an archive in the digital age?” Much of the discussion is about digitization, although I think a lot of what I said applies to born-digital material as well. As an added bonus, I decided to go against prevailing trends and decline to offer up a definition of “digital humanities.” I figure it gets defined so often, I don’t need to add my own. (My interview answer is less glib.)

Anyway, I’m probably still new enough to archives that you should apply the appropriate discount rate to what I say, but I was glad to have the opportunity to do some real writing again. I’ve been struggling to get back into the routine of blogging regularly.

a sign of the times in academic publishing

Catching up on end of the year email, I came across the following notice in the UC Berkeley Department of History’s Fall 2012 newsletter (pdf):

Geoffrey Koziol’s new book was published by Brepols: The Politics of Memory and Identity in Carolingian Royal Diplomas (2012). Thanks to subventions from the History Department and UC Berkeley’s Committee on Research, the price is a moderate $100, which may seem like a lot, but European academic presses are increasingly pricing books at $200, beyond the ability of even mid-sized college libraries to afford them. It is becoming very difficult to publish innovative scholarship of any length and complexity. Flexible sources of funding are sorely needed.

Incidentally, although I eventually found myself specializing in American history, Koziol’s undergraduate survey course on medieval Europe played a big part in my decision to major in history. You can find out more about his book here.

how I’ve been approaching MOOCs

For a while there, I almost forgot I had a blog. At the start of the fall, I decided to sign up for a bunch of online courses: mostly Coursera-hosted, but also a couple on Udacity, and then in October I signed up for a class on edX and another class on Stanford’s open source platform, Class2Go. Did I really intend to “take” all of those courses at the same time?

Of course not, although I am committed to finishing the Udacity courses I’m in, as those relate most closely to some personal projects I’m working on (more on that another time). But even if I had wanted to finish all those courses, I signed up for so many that it would have been impossible. Mainly, I just wanted to explore.

Free online courses – I mean MOOCs, but it still sort of pains me to write “MOOC” – look for the most part like regular in-curriculum courses so it’s no surprise that the language we already have for talking about taking classes (“enroll”, “take”, “drop”, etc.) gets applied to MOOCs as well. But I don’t think simple add/drop language really captures how I’ve come to approach online courses over the past six months.

Jokes about being a “serial MOOC dropout” aside, I’ve come to believe that the fact that you aren’t obligated to finish these courses is actually one of the features that makes them valuable to a post-school learner like me. Obviously, you’re usually going to get more out of a course if you complete it than if you don’t. But traditionally, if you enroll in a course and then find you can’t finish it, you either drop it or you take a stiff penalty for doing poorly. Either way, once the course is done, that’s it. When I was in college, I used to attend a bunch of different courses during the first week or two of a term, collect syllabi, and then pare down my schedule to what I could manage. Pretty much all I got out of the courses I didn’t take were syllabi and reading lists.

Many MOOCs, however, allow you to download all or most of the course materials for your own personal use. It’s still not as good as taking an actual online or in-person course, but it’s more than getting just a syllabus and often more than what gets posted on university OpenCourseWare sites (which I nevertheless find quite useful). From this perspective, I think Tim Burke is right when he writes that “maybe MOOCs are an exciting new form of publication, not teaching” (although I’d push back a little and argue that publications can teach).

At the same time, if you want to try things out – at least while the class is still “live” – you usually can even if you don’t plan to finish the course. This may not work with courses that have peer assignments – I haven’t taken any of those – but the courses that make use of auto-grading open up the platform to everyone who signs up. You won’t get personalized attention, but sadly that’s the case with most MOOCs whether or not you finish them.

When I first signed up for an online course, I hadn’t really thought much about these affordances. I was still operating under the assumption that if you can’t take a course, you just drop it. As it happened, the first course I signed up for was a natural language processing course offered by Stanford last spring. There was a delay between the initial announcement and the actual course launch and by the time the class started I was so busy finishing up grad school that I had no time for it. So I “dropped” it and never even logged in to look at the materials.

I’d never do that now; instead, I’d have at least signed in to download the materials for future reference.

Looking back, I see that over the course of the summer and fall, without really intending to in any systematic way, I’ve developed the following approaches to MOOCs:

(Disclaimer: These are general, overlapping categories, and sometimes I move a course from one category to another. But I find it more helpful to frame things this way than to earnestly believe I’m going to finish every class I sign up for. Letting assignments go without thinking about them took a surprising amount of getting used to.)

Collection-building: this is pretty much an extreme form of treating a MOOC like a traditional publication. Sometimes, when I learn about an interesting book that will be available temporarily for free, I download it even though I know it could be a long time before I read it. Similarly, sometimes courses are offered on topics I want to learn more about, but which are not currently priorities for me. In those cases, I’ll sign up simply to gain access to the content.

If the course looks good,  and the platform allows downloading, I’ll save my own copy of the materials as they become available. This can take up a bunch of storage space, but there’s no guarantee that the course will be offered again or that the learning platform will continue to allow access after the course is over (or that the learning platform will even continue to exist). Saving the material locally means that I’ll at least be able to view the lectures and possibly do the readings – if they’re openly available, which they sometimes aren’t – when I’m ready for it. It’s sometimes even possible to save the exercises, though there won’t be anyone to grade them.

(I should note that this isn’t an exclusive category: I also save the materials for courses where I do more than just add to my MOOC collection.)

Exploration: this is probably the approach I use most frequently. Much like collection-building, I do this for courses where the topic isn’t really a priority, but I want to know more about it. But unlike with collection-building, I’ll do the first quizzes or assignments. Usually, this is because the course involves some tool or technique or method I just want to try out and the course gives me a controlled environment where I can do that. If a course just gets me to install and mess around with some tool (or language or whatever) I’ve been curious about, that by itself can make it worthwhile to have signed up.

Other times I sign up for courses that ask for prerequisites I don’t really have – many technical courses are like this – and I want to see just how much I have to learn if I’m going to try to complete a similar course “for real” in the future. You’d be surprised at how much you can learn while not finishing a course.

Auditing: this is more or less equivalent to auditing a regular course. Here my plan is simply to listen to the lectures and do some of the readings, but not to do the assignments. Or at least, if I write something about the course material, I’ll do that on my own, outside of class time and not necessarily in response to the class prompts. I tend to reserve auditing for fields where I already have a lot of experience doing the work – like history.

Taking: yes, I actually do take some courses in the traditional sense. This can be the most difficult category for me to determine ahead of time. Most of the courses I’ve seen at Coursera and edX have time limits and if your schedule doesn’t fit into that time frame it can be quite difficult to keep up. Often, a busy week or two in non-MOOC life can be enough to throw everything off.  As a result, my default approach is to explore and then I re-evaluate as the course progresses. It’s no coincidence that the courses I’m currently still “taking” are self-paced Udacity courses, which I know I’ll be able to finish eventually.

I’ve mostly limited my “real” course-taking to what I consider foundational courses, such as learning how to program, or courses that are directly related to personal projects I’d like to carry out, such as web development. After all, my goal in taking courses isn’t simply to be able to say I’ve taken courses: it’s to learn things that I can and will put into use.

MOOC-learning isn’t the only learning I’ve been doing lately. In fact, I’ve found myself turning back to regular books, including tech books. But as this post has already grown far longer than I intended, I’ll save the MOOC-to-book discussion for another day.

a MOOC disclaimer

From my “Statement of Accomplishment” – not a certificate – for the Coursera course in Human-Computer Interaction, taught by Scott Klemmer of Stanford:

Please note: This online offering of Human-Computer Interaction does not reflect the entire curriculum offered to students enrolled at Stanford University. This document does not affirm that you were enrolled as a Stanford student in any way; it does not confer a Stanford grade; it does not confer Stanford credit; it does not confer a Stanford degree or a certificate; and it does not verify the identity of the individual who took the course.

The statement seems reasonable given that I took the quiz track, which meant that the only work I “showed” were the answers to 19 multiple choice quiz questions. I’m actually a bit surprised I did well enough to reach the “accomplishment” threshold: there was a substantial penalty for quizzes submitted after the due dates and I started a few weeks late. My original intention was just to audit, and I did watch all of the lectures. I took the quizzes just because they were there. I can’t say I’m a fan of multiple-choice – especially the “choose the ‘best’ answer” style of multiple choice – but what I got out of the lectures outweighed my dislike of the quiz format.

All in all, I thought it was a good course, even if the way I approached it made it more like a videobook than an actual class. There was another track where you built and tested your own designs and reviewed others’ projects, but I was too late to participate in that. If anyone reading this is interested in taking the course when it’s offered again, I recommend trying that track out if you have the time.

what would it take for historians to be able to share archival material?

Recently, a friend of mine asked if I had any thoughts on why historians tend not to do much sharing of archival materials – that is, of materials that they’ve collected in the course of their research. I said I didn’t really know why, but I could speculate, and since speculation is one of the reasons blogs exist, I thought it would be worth writing up a post about it. The conversation also got me thinking in a more positive direction: let’s say historians do start sharing more archival material, what forms could that sharing take? What kind of infrastructure would they need? Is it something we could start building now?

But first, what do we mean by sharing archival material? Let’s say you’re a historian and you’re on a research trip. You request material and some of it turns out to be relevant to your research, some not so much. (And some of it is just too interesting to pass up.) You take notes, maybe even make some full transcriptions, but there are almost always going to be some materials that you decide you want to copy. Maybe you want to be able to see just how the document was laid out, maybe you want exact wording but don’t have time to transcribe it, or maybe you simply don’t have enough time to read the documents during your visit, but you can take lots of photographs quickly. Whatever the reason, odds are you’re going to come home and find yourself with lots of copies of archival material from the trip. This is the kind of material we were talking about sharing.

A second preliminary point: historians do share. Maybe not everyone, maybe not all the time, and almost certainly not everything, but I don’t want to give the impression that historians solely collect and hoard documents and then guard their hoards. However, I think much of the sharing that goes on stops short of sharing actual (copies of) material. You’ll see historians talk to each other about what they’ve found; give each other advice about what to expect when working at a particular place or on a particular collection; or even publish articles in historical journals discussing where to find sources for various topics or, conversely, what kind of topics could be researched using  particular collections. All of this certainly counts as sharing, but it may not extend to the sharing of archival material to go along with information about archival material. That said, there is still a tradition of formally publishing selected primary sources, whether in journals or as edited book collections. This may consist of archival material (in the sense that archivists understand by the word “archives“) and previously published material.

I am deep in the realms of speculation here, but I suspect that when historians do share archival material – outside of formal publication – it tends to be stuff they are not actively using. This could be stuff they’re done with, or it could be “incidental finds”: stuff they’ve collected that turns out not to fit in with their research, but which they know may be relevant to another researcher (“I was looking through the papers of so-and-so and came across these letters, thought you’d be interested so I’m passing them along”). Sharing those kinds of finds is, not so incidentally, one of the reasons I went into the archives/library fields: I love playing matchmaker between sources and researchers.

These kinds of sharing – whether of information, materials, published research – shows the scholarly community at its best, so why don’t more historians do more sharing of archival materials (assuming that it is accurate to say that many don’t)?

Here are my guesses:

1. It hasn’t become standard practice, so it’s not something that occurs to everyone while they’re doing research. That may be a tautological explanation, but I really think this is something that could be self-reinforcing: if more historians were already sharing material, then you’d probably see more sharing. There’d be more models for it.

2. Worries about being “scooped.” Releasing their raw materials, so to speak, might make it possible for someone else to use the material they collected and then publish first. Depending on context, this might be a real concern, but in other cases the two historians might end up taking very different interpretative approaches: priority in publishing isn’t quite as important in history as in some other fields. Also, this shouldn’t really be a big concern once the historian who collected the material has published.

3. This is closely related to point 2: historians still generally get the most credit for traditional publications. This seems to be changing, but the incentives have long been weighted towards publishing and disseminating finished research, rather than the materials on which research could be based.

4. A “do your own research” ethic. Maybe I’m being uncharitable here, but I think many people who are more than willing to talk about material they’ve found could still be reluctant to share the copies they’ve made themselves, especially if it took a lot of time, effort, and money to collect them. I suspect people are more willing to share when they’ve built up trust with their colleagues and when there’s some reciprocity involved. This also ties in to the point about credit and incentives.

5. Permissions/rights. In my experience researching the 19th and 20th century US, it’s pretty uncommon to come across truly unrestricted archival materials. In the days when I primarily requested photocopies, the vast majority of those copies arrived with stamps on them saying that they were for personal research use only and that further permission would be required if I wanted to use them for any other purpose. Even when taking digital photos myself, there’s usually an agreement somewhere that puts similar restrictions on those images. Furthermore, copyright in unpublished materials can be a really complicated area, especially if it’s not something you’ve been trained to navigate. The physical owner of a letter, for instance, might not have the right to publish that letter, much less grant permission to do so to someone else.

6. Lack of infrastructure. Let’s say you have material to share and you have the right to share it (or are just willing to take risks): how are you going to do that? You could e-mail a few files or send out paper copies in an envelope – if there even is a paper form to your records – but what if you have a hundred or more files/images/pages?  And how are you going to handle the descriptive context and content that goes with the material? You usually need citation and location information, at the very least, if you’re going to authenticate the materials as being legitimate copies of the originals. You should have this information, if your intent was to collect things in a way that would make it possible to cite them later, but it’s still something to watch out for.

I think that last point is really key: once you’ve gotten past all of the other objections, there’s still the problem of coming up with an effective way to share material that the average historian could actually carry out without too much trouble. Not everyone has the time/background/resources to just go out and build  their own digital repository/collection/archive (I’m sidestepping the terminology question here).

What are the possibilities? I can think of a few:

1. Personal networks. I guess you could call this peer-to-peer sharing, if you like putting everything into technology terms. This is basically scholars sharing material with each other at an individual level. This can be done through the mail or in person – I assume that for most of the history of history, when scholars shared material this is how they did it – or through e-mail or other file transfer methods.

Advantages: It’s pretty simple and doesn’t really require historians to do anything they don’t already know how to do, unless they’re trying some complicated file transfer method. It also happens to be a method that historians are already using.

Challenges: It’s not public, for one thing. So it’s not quite open sharing. (For some I’m sure that’s a feature, not a bug.) It also might not scale very well as the volume of material that gets transferred grows. Plus there’s a potential problem of losing track of essential metadata when sending around batches of image files: you have to be careful not to end up with directories full of filenames like DSCG1128 with no clear indication of what archives and what collections those files are supposed to be linked to. That latter issue is something everyone has to face when managing image and text collections, but coordinating among many different individuals is likely to be more difficult than coordinating among institutions or groups.

2. Historian-hosted websites: Historians could set up their own websites to host the material they want to share.

Advantages: This could be open to any visitor, though of course the site owner could also employ password protection. It also would maintain the connection between the historian who collected the material and the material. If the historian were to change affiliation, as often happens in the academic world, the site could “move” with them fairly easily (in the sense of being updated to reflect the new affiliation).

Challenges: It requires historians to know how to host a site and manage an image and/or text collection, or at least to have access to someone with that knowledge. (Note: I’m not saying these are bad skills to have, just that you can’t assume many historians have them right now.) This actually might not be too difficult, depending on the platform being used. I didn’t have to know much about the internal workings of wordpress to be able to set up this blog, but finding a pre-packaged archival system that’s easy for a regular user to set up and maintain is a bit trickier. WordPress is comparatively simple.

Also, this could lead to material being distributed across dozens of personal sites, which could make it difficult to find things. As with option 1, coordinating among lots of individuals can be difficult. And what if two or more historians have materials they copied out of the same collection? Ideally, that would get linked up.

3. Institutional hosting based on the researcher’s affiliation: The researcher’s home institution supports and hosts the materials.

Advantages: As in the historian-hosted model, in this model the materials could be placed on the open web. Ideally, institutional support would mean that the institution’s archivists, librarians and IT staff would all collaborate, reducing the burden on any one individual. Institutions might be able to work the archival materials into existing infrastructure, such as a digital repository if they have one up and running.

Challenges: As mentioned above, academics often change affiliation. What happens to the material then? Does it become part of the institution’s holdings or will it be transferred? Or will one copy go with the historian and one stay with the institution? And will the new institution want to host material that’s been/being hosted elsewhere?

Another issue that could come up is the difference between records that the historian produces – such as notes, drafts, teaching materials, and other personal papers – and those that the historian collects – such as archival and other source material, much of which will be copies of materials held at other institutions. The historian’s home institution might be very interested in keeping the (or “their”) historian’s personal papers while at the same time being reluctant to keep copies of source materials taken from elsewhere.

There would also still be a need for coordination to make it possible for researchers to search across different institutions’ holdings. This is essentially the same problem the historian-hosted model would face, but at least there would be fewer institutional sites and many institutions already have a history of sharing metadata.

One additional note: the Valley of the Shadow project, which I think has been both successful and influential, might fit this model. William G. Thomas III and Edward Ayers have since moved to other institutions, but the site remains at the University of Virginia.

4. Archival institution hosting: in this model, the institution that holds the original also makes the digital copy available.

Advantages: lots of archives already have ongoing digitization projects. As holders of the originals, they are in the best position to authenticate the material they put on the web. They are also in the best position to maintain the links between individual items and their archival context – that is, where the items fit in within the larger context of the collection and perhaps of the institution as a whole. Duplication of copies among researchers shouldn’t be a problem, as the originals (or maybe we should call them “original copies”?) will be available at the archives’ site.

Challenges: Historians’ choices of what to copy and archivists’ choices of what to copy are likely to diverge quite often. Historians are probably most interested in individual items or ranges of material within collections. This can make perfect sense in the context of a research program, but to an outside observer it might look rather haphazard and partial and may not make the best focus for a digitization project. Archivists have to be concerned with their own institutional priorities and in many archives historians may not even be the primary users. That said, there are surely many opportunities to collaborate  on projects and I’m sure that historians will find many archives’ own digitization projects useful for their research.

As for the kind of sharing I’ve been talking about in this post, there are some archives that employ “scan-on-demand” policies in which material is scanned as it’s requested. I don’t know how many of these scans get posted to the open web – in some cases, the scanning simply makes it less costly to produce additional copies in the future – but it could be one way to facilitate sharing among historians. I think some archives are also experimenting with programs where historians can take digital photographs in the course of their own research and then have the option of giving the archives a copy of those photos (or some subset of them) to then be put on the web. But I’m not sure if that’s actually happening, or I’ve just read about it as a proposal.

5. Some other kind of consortial or centralized hosting. Could this be something like an arxiv for collected archival material? In theory, it would be possible to create something like that, but getting it off the ground could be difficult, as it would have to find a home somewhere. Maybe this is a possibility that the Digital Public Library of America could look into. Many public libraries have history rooms, after all.

Those are the five main models I could come up with off the top of my head. I think you can probably find actual examples of the first four, although I’m not sure I’ve come across a personal website hosting copies of archival material. My takeaways from this exercise:

  1. Outside of mailing packages from place to place, we’re really talking about digitized or digital materials here and the web remains the most open way to share them.
  2. To make this kind of sharing more open and more routine, historians need to have relatively accessible ways to transfer their material into a system for sharing.
  3. In the near future, I think we’ll see the first and third model most often. That is, historians will continue to share with colleagues and peers at an individual level, while larger-scale sharing will come mostly in the form of projects. Projects make it possible to pool resources and seem to align best with scholarly incentives.
  4. As for model 4, I think archives-driven projects will continue to be much more common within archives than historian-driven projects, for obvious reasons. However, the boundaries between models 3 and 4 are pretty artificial, as archives and research institutions already do a lot of collaboration – not to mention the fact that many universities have archives and special collections on campus. So in some ways the boundaries are an artifact of the way I’ve set up the post.
  5. There need to be ways to share and combine metadata so that people can search and browse across sites and collections. This is already true and people are working on it.
  6. There’s no escaping permissions and rights questions. Another point already in effect.
  7. A lot of what I’ve written here applies to any type of researcher who uses archives. I’ve focused on historians because that’s the context of my original conversation but I don’t mean to exclude other researcher groups from the larger discussion.
  8. There are a lot of issues I haven’t even gotten into, such as bulk access to archival material.
  9. Sometimes you just have to see the originals for yourself. Nothing wrong with that, especially if you like visiting archives and you’ve got time and support.

some thoughts on open data (1): movements and communities

Back in May Tom Slee started what turned out to be a really interesting conversation on open government data when he wrote a couple of blog posts that were fairly critical of the open data movement as seen from the perspective of progressive politics. The thread then got picked up by Crooked Timber, which ran a seminar on open data in late June/early July. The conversation pulled in a number of open data advocates and critics, and I guess it got a bit heated at times, but I found the whole thing helpful in drawing out a lot of the things people should think about when thinking about open data.  If you missed the discussion and are interested in reading more, I put up a whole bunch of links in my last post.

I meant to write some comments of my own while the discussion was still going, but various things came up and I’m finally just now getting back to it. The conversation was broad enough that it would be impossible to respond to everything; besides, I’m still at a “gathering my thoughts” stage about all of this. So what I’m going to try to do in this series of posts is just pull out some of the more salient points that came up about open data, and then add a few of my own comments. I’m numbering the posts, but just for convenience – I’m coming up with the order as I go.

1. Movements and communities

One point of dispute was whether the “open data movement” can really be considered a movement – in the sense of having a unified goal or goals – or if, as Slee argues, there are just too many different groups and individuals involved, representing too many different goals and interests for it to be considered just one thing. If pressed, I’d probably shy away from using the term “movement” to describe the broader phenomenon, but for now I’m just going to sidestep the issue. Like some other commenters, I’m not sure how important it is to come to a strict conclusion on this point.

I do think that you can, arguably, identify an open data community. Now, you might object, didn’t I just substitute one seemingly-precise-but-not-strictly-defined term (“community”) for another (“movement”)? It’s a valid question. I just find the concept of a community more helpful in thinking about open data advocates. Different groups might have different ultimate goals and interests beyond their open data work, but while they’re working on open data, they generally seem to use similar language and to have similar needs. This still could prove to be a rather transient community as people move on to other things, but they’re neighbors while in town.

Or maybe that’s still a too simplified way of looking at things. Most of the discussion of open data I’ve been referring to has been about open government data: data collected, produced, and disseminated by government bodies at any jurisdictional levels. But that’s just one category of data that could be made more open.

I can think of a few others: there’s the open access movement, which centers on academic research. It isn’t just about data, but it includes a call for open access to the data produced/collected as part of research activities. Some of this data comes from studies that have received government funding, but it’s not really government data in the sense of census data. A second example would be LODLAM – linked open data in libraries, archives, and museums. Again, many of these institutions are public, but this also doesn’t really fall into the government data category as conventionally understood.

While there are probably some people who are working on all of these types of data at once, it would be hard to say that there’s just one large community that includes everything under these umbrellas. So it might be more accurate to think of multiple open data communities, each with points in common with the others but working in different domains.

One final note: I’ve been reading Crooked Timber for years and I got the impression that the open data seminar didn’t generate as much activity in the comment threads as some other topics have in the past. Now, I could just be mistaken about the comment volume, but I did feel like there was a real mismatch on some posts between the level of engagement of the posters and that of the commenters. (However, a few posts got a lot of comments, like Slee’s “Seeing Like a Geek“.) This impression is another reason I think it would be possible to identify an open data community: within it, people are eager to talk, debate, share ideas, but outside of it the level of interest is apparently much lower. Or maybe this was just a result of the seminar being held during the mid-summer holiday season.

Next post [not written yet]: government information and information about government.

some readings on open data

I’ve been thinking about open data a lot this summer. Partly this is because I have a longstanding interest in transparency and government information, so I’m kind of always thinking about that sort of thing, but also because I’ve been following the wide-ranging conversation on open data that Tom Slee started back in May with a couple of blog posts: “Why the “Open Data Movement” is a Joke” and a lengthier follow-up, “Open Data Movement Redux: Tribes and Contradictions.” As you can probably guess from the titles, these posts were provocative – mostly in a good way, I thought – and generated a lot of discussion within the open data community.

Now, I should say up front that I’m going to leave my thoughts on the issues that Slee and others raised for a later post. This post is just a place for me to gather together links to what I’ve read recently on open data that I think others might be interested in. I started out intending to write just one post but the list of links to background readings got so unwieldy that I figured it would be better if I separated it out. The list is in a loose sort of chronological order, starting with Slee’s first two posts and some of the works he cites, then continuing on from there.

Disclaimer: I haven’t read everything Slee cited or every response to his posts. I’m sure I’ve left out some good stuff. Suggestions are welcome.

Starting the conversation

Tom Slee. “Why the “Open Data Movement” is a Joke“, Whimsley (1 May 2012)

____. “Open Data Movement Redux: Tribes and Contradictions“, Whimsley (8 May 2012)

Additional background (a.k.a “footnotes I followed from Slee’s posts”)

Jo Bates. “‘This is what modern deregulation looks like’ : co-optation and contestation in the shaping of the UK’s Open Government Data Initiative.The Journal of Community Informatics (April 2012)

Michael Gurstein. “Open Data: Empowering the Empowered or Effective Data Use for Everyone?“, Gurstein’s Community Informatics, Volume 8 Number 2 (2 September 2010)

____. “Open Data (2): Effective Data Use“, Gurstein’s Community Informatics (9 September 2010)

[Gurstein later revised these posts into: "Open data: Empowering the empowered or effective data use for everyone?", First Monday, Volume 16 Number 2 (23 January 2011)]

Harlan Yu and David G. Robinson. “The New Ambiguity of ‘Open Government‘”, SSRN (28 February 2012)

Early responses

In addition to the comments on Slee’s two posts, I found the following posts particularly worth reading. Both were written in response to Slee’s first post.

David Eaves. “Open Data Movement a Joke?“, eaves.ca (2 May 2012)

Tom Lee. “Defending the Big Tent: Open Data, Inclusivity and Activism“, Sunlight Foundation Blog (2 May 2012)

Crooked Timber forum on open data

Following this first round of discussion, the conversation then shifted over to Crooked Timber, which ran a seminar on open data in late June/early July. Since Henry Farrell has already put together a page with links to all nine contributions, I’ll just link to that page instead of writing out every link. Slee wrote the lead post and the other contributors were Victoria Stodden, Steven Berlin Johnson, Matthew Yglesias, Clay Shirky, Aaron Swartz, Henry Farrell, Beth Noveck, and Tom Lee.

In addition to the seminar posts, I also recommend:

David Eaves. “Unstructured Thinking on Open Data: A response to Tom Slee“, eaves.ca (28 June 2012)

John Wonderlich. “Open Data Creates Accountability“, Sunlight Foundation Blog (6 July 2012)

As I said above, this list is not comprehensive; it’s just the subset of the larger conversation that I’ve happened to have read so far. But I think it’s a decent place to get started.

about (1)

It is difficult to write a good short bio; it can be difficult to share a good detailed one. I doubt this will turn out to be either, but I feel like I should write one anyway. To put it euphemistically, I am currently “between things”; to put it more literally, I am currently on the job market. So I can’t yet rely on a job title to carry information about myself. Instead, I’ll say something about my background and my interests.

I have graduate training and experience (and masters degrees, even) in history, archives, and library and information studies. I also have work experience in journalism (I was an intern at Talking Points Memo) and open government/transparency (through an internship at the Sunlight Foundation). I’m interested in pretty much everything that goes with that background: history, archives, libraries, information and especially access to information, journalism, politics, government. I’d even say I’m interested in bureaucracy but I don’t want to sound boring.

In the last few years, I’ve also become really interested in computers and technology. I’m not going to chase every subject that has the word “digital” in it, but I’m certainly interested in digital preservation, digital archives, digital libraries, digital history and the digital humanities – you get the point. I’m learning to code and getting more and more comfortable with Linux (Ubuntu) and free and open source software every day. As someone who did a bit of programming in junior high and high school (Logo and Pascal, those were the days), but then spent years using computers mainly for word processing and web browsing, it’s been an interesting experience.

Anyway, this is my personal website and personal blog, and even though I could probably assign it a call number, give it a few subject headings, and place it in a taxonomy somewhere, I’m not going to classify it. The odds are pretty high that what I’ll write about will be consistent with the interests I’ve just talked about.

Since there’s not a lot of content here yet, you might want to check out these posts if you’re curious about my writing:

A post I wrote for my old blog reacting to Nicholas Carr’s original “Is Google Making Us Stupid” article. This is the only post from the old blog that I’ve copied over onto this blog. I just like it for some reason.

A “this day in history” post I wrote about the U.S. Declaration of Independence and the U.S. Constitution during World War II that appeared as a guest post at The Edge of the Aemrican West.

Something I wrote about Brandeis and the history of transparency at the end of my internship at the Sunlight Foundation a few years back.

And for the library crowd, something I wrote on my old blog about subject headings. This one might not sound the most exciting, but it was once called “high-quality library nerdery” on twitter, so there’s that.

Finally, a meta note: You may have noticed that this is an “about” post rather than an “about” page. I’m going to try a bit of an experiment. As time passes I’m likely to want to update my bio. Rather than keep changing the page, I’m going to write new “about” posts each time and then keep the old ones. This might not happen often, but it could be interesting. Four years ago I would not have even thought to mention computers and technology.

My instructor was Mr. Langley, and he taught me to sing a song. If you’d like to hear it I can sing it for you.

[Note: This post originally appeared on my old blog on June 15, 2008 and I am posting it here under that date. I didn't want to carry over that whole blog, but I did want to keep this post with me. For the record, today is actually July 24, 2012.]

————

I read this

Over the past few years I’ve had an uncomfortable sense that someone, or something, has been tinkering with my brain, remapping the neural circuitry, reprogramming the memory. My mind isn’t going—so far as I can tell—but it’s changing. I’m not thinking the way I used to think. I can feel it most strongly when I’m reading. Immersing myself in a book or a lengthy article used to be easy. My mind would get caught up in the narrative or the turns of the argument, and I’d spend hours strolling through long stretches of prose. That’s rarely the case anymore. Now my concentration often starts to drift after two or three pages. I get fidgety, lose the thread, begin looking for something else to do. I feel as if I’m always dragging my wayward brain back to the text. The deep reading that used to come naturally has become a struggle.

and thought that I’ve been having the same experience for a few years now, except that when I lose a thread while reading a book or article online and look for something else, that something else is more text in another tab or window. Then I remembered that I’ve always had to put energy into concentrating on what I’m reading, even if I find it interesting. The only exceptions are things I find engrossing – even if I don’t find them interesting. What makes something engross me? I don’t exactly know. I’d say “good writing” but that’s hardly a satisfying explanation.

I read this

Research that once required days in the stacks or periodical rooms of libraries can now be done in minutes. A few Google searches, some quick clicks on hyperlinks, and I’ve got the telltale fact or pithy quote I was after.

and thought that the seeming thinness of research aimed mainly at gathering “telltale fact”s or “pithy quote”s resides more in its goals than in its methods.

I read this

I’m not the only one. When I mention my troubles with reading to friends and acquaintances—literary types, most of them—many say they’re having similar experiences. The more they use the Web, the more they have to fight to stay focused on long pieces of writing. Some of the bloggers I follow have also begun mentioning the phenomenon. Scott Karp, who writes a blog about online media, recently confessed that he has stopped reading books altogether. “I was a lit major in college, and used to be [a] voracious book reader,” he wrote. “What happened?” He speculates on the answer: “What if I do all my reading on the web not so much because the way I read has changed, i.e. I’m just seeking convenience, but because the way I THINK has changed?”

and the sentiments felt familiar. I may have always had to work to keep focused on long writing, but I used to finish books at a much higher rate. Outside of required readings, I used to start multiple books at once until I found one that held my interest until I finished it, at which point I re-started the process. Now it seems like I’m always beginning books.

I read this

“I can’t read War and Peace anymore,” he admitted. “I’ve lost the ability to do that. Even a blog post of more than three or four paragraphs is too much to absorb. I skim it.”

and thought, who can read War and Peace in any sort of “normal” way at all? I read it in bunches over a period of about a month, quickly at first when I was into it, more slowly when I began to get frustrated with the plot about halfway through, lethargically as I approached the end, determinedly as I read the final few hundred pages in one sitting, knowing that if I put it down I was in danger of never picking it up again. I reflected that reading fiction has always been a different experience with me than reading non-fiction. I can’t skim fiction. I might read blog posts quickly, but I don’t skim them unless I’m deciding whether or not to then read them.

I read this

As part of the five-year research program, the scholars examined computer logs documenting the behavior of visitors to two popular research sites, one operated by the British Library and one by a U.K. educational consortium, that provide access to journal articles, e-books, and other sources of written information. They found that people using the sites exhibited “a form of skimming activity,” hopping from one source to another and rarely returning to any source they’d already visited. They typically read no more than one or two pages of an article or book before they would “bounce” out to another site. Sometimes they’d save a long article, but there’s no evidence that they ever went back and actually read it.

and wondered if there was also evidence that they never went back and actually read those articles. I wondered if the authors considered that people may be exhibiting “a form of skimming activity” because they were skimming to see which of their search results were useful, if any. Or because they were curious about something they found but weren’t looking for. I wondered if browsing nearby books in the stacks is “a form of skimming activity.” I wondered if this says something about how people search as well as about how people read.

I read this

“We are not only what we read,” says Maryanne Wolf, a developmental psychologist at Tufts University and the author of Proust and the Squid: The Story and Science of the Reading Brain. “We are how we read.” Wolf worries that the style of reading promoted by the Net, a style that puts “efficiency” and “immediacy” above all else, may be weakening our capacity for the kind of deep reading that emerged when an earlier technology, the printing press, made long and complex works of prose commonplace. When we read online, she says, we tend to become “mere decoders of information.” Our ability to interpret text, to make the rich mental connections that form when we read deeply and without distraction, remains largely disengaged.

and tried to remember where I saw Wolf’s work discussed recently. I resisted searching for it right then and there. [I later looked and found it: Caleb Crain's essay "Twilight of the Books" on the future of reading.]

I read this

Reading, explains Wolf, is not an instinctive skill for human beings. It’s not etched into our genes the way speech is. We have to teach our minds how to translate the symbolic characters we see into the language we understand. And the media or other technologies we use in learning and practicing the craft of reading play an important part in shaping the neural circuits inside our brains. Experiments demonstrate that readers of ideograms, such as the Chinese, develop a mental circuitry for reading that is very different from the circuitry found in those of us whose written language employs an alphabet. The variations extend across many regions of the brain, including those that govern such essential cognitive functions as memory and the interpretation of visual and auditory stimuli. We can expect as well that the circuits woven by our use of the Net will be different from those woven by our reading of books and other printed works.

and thought, that may be true, but doesn’t that mean we have the flexibility to re-wire if we change our behavior? So to the extent that there’s a change taking place, it might not be a permanent one.

I read this

Sometime in 1882, Friedrich Nietzsche bought a typewriter—a Malling-Hansen Writing Ball, to be precise. His vision was failing, and keeping his eyes focused on a page had become exhausting and painful, often bringing on crushing headaches. He had been forced to curtail his writing, and he feared that he would soon have to give it up. The typewriter rescued him, at least for a time. Once he had mastered touch-typing, he was able to write with his eyes closed, using only the tips of his fingers. Words could once again flow from his mind to the page.

and thought of Francis Parkman, who needed a special tool to help him hand-write along straight lines as his vision worsened.

I read this

In Technics and Civilization, the historian and cultural critic Lewis Mumford described how the clock “disassociated time from human events and helped create the belief in an independent world of mathematically measurable sequences.” The “abstract framework of divided time” became “the point of reference for both action and thought.”

and thought that maybe if I finish reading Mumford’s two best known Cities books, I might read some of his other work. I remembered that I decided not to get a used copy of Technics and Civilization recently because I wasn’t sure how it stood in relation to his other work – and, more importantly, because it was kind of heavy and I didn’t want to carry it when I moved.

I read this

When the Net absorbs a medium, that medium is re-created in the Net’s image.

and was reminded of Marx writing that the bourgeoisie creates the world in its own image.

I read the rest of that paragraph

It injects the medium’s content with hyperlinks, blinking ads, and other digital gewgaws, and it surrounds the content with the content of all the other media it has absorbed. A new e-mail message, for instance, may announce its arrival as we’re glancing over the latest headlines at a newspaper’s site. The result is to scatter our attention and diffuse our concentration.

and thought: you can change a lot of those settings, you know.

I read this

The Net’s influence doesn’t end at the edges of a computer screen, either. As people’s minds become attuned to the crazy quilt of Internet media, traditional media have to adapt to the audience’s new expectations. Television programs add text crawls and pop-up ads, and magazines and newspapers shorten their articles, introduce capsule summaries, and crowd their pages with easy-to-browse info-snippets. When, in March of this year, The New York Times decided to devote the second and third pages of every edition to article abstracts, its design director, Tom Bodkin, explained that the “shortcuts” would give harried readers a quick “taste” of the day’s news, sparing them the “less efficient” method of actually turning the pages and reading the articles. Old media have little choice but to play by the new-media rules.

and wondered if the author thought these were all bad developments. More ads and shorter articles certainly don’t seem like a positive step, but abstracts and snippets, done well, could be quite helpful. Assuming abstracts aren’t all that people ever read.

I read this

Taylor’s system is still very much with us; it remains the ethic of industrial manufacturing. And now, thanks to the growing power that computer engineers and software coders wield over our intellectual lives, Taylor’s ethic is beginning to govern the realm of the mind as well. The Internet is a machine designed for the efficient and automated collection, transmission, and manipulation of information, and its legions of programmers are intent on finding the “one best method”—the perfect algorithm—to carry out every mental movement of what we’ve come to describe as “knowledge work.”

Google’s headquarters, in Mountain View, California—the Googleplex—is the Internet’s high church, and the religion practiced inside its walls is Taylorism. Google, says its chief executive, Eric Schmidt, is “a company that’s founded around the science of measurement,” and it is striving to “systematize everything” it does. Drawing on the terabytes of behavioral data it collects through its search engine and other sites, it carries out thousands of experiments a day, according to the Harvard Business Review, and it uses the results to refine the algorithms that increasingly control how people find information and extract meaning from it. What Taylor did for the work of the hand, Google is doing for the work of the mind.

The company has declared that its mission is “to organize the world’s information and make it universally accessible and useful.” It seeks to develop “the perfect search engine,” which it defines as something that “understands exactly what you mean and gives you back exactly what you want.” In Google’s view, information is a kind of commodity, a utilitarian resource that can be mined and processed with industrial efficiency. The more pieces of information we can “access” and the faster we can extract their gist, the more productive we become as thinkers.

and had a few questions:

  1. What happened to labor? Is Google’s workforce organized along Taylorized lines? Reports suggest that the answer is “no,” at least for some subset of employees.
  2. How can the claim that the internet is encouraging Taylor-like efficiency be reconciled with an article premised on distraction and lack of concentration? It sounds like it is the search engine itself that’s being Taylorized.

I read this

“The ultimate search engine is something as smart as people—or smarter,” Page said in a speech a few years back. “For us, working on search is a way to work on artificial intelligence.” In a 2004 interview with Newsweek, Brin said, “Certainly if you had all the world’s information directly attached to your brain, or an artificial brain that was smarter than your brain, you’d be better off.” Last year, Page told a convention of scientists that Google is “really trying to build artificial intelligence and to do it on a large scale.”

and thought it sounded like marketing.

I read this

The idea that our minds should operate as high-speed data-processing machines is not only built into the workings of the Internet, it is the network’s reigning business model as well. The faster we surf across the Web—the more links we click and pages we view—the more opportunities Google and other companies gain to collect information about us and to feed us advertisements. Most of the proprietors of the commercial Internet have a financial stake in collecting the crumbs of data we leave behind as we flit from link to link—the more crumbs, the better. The last thing these companies want is to encourage leisurely reading or slow, concentrated thought. It’s in their economic interest to drive us to distraction.

and thought it was a good point. I wondered if it would have been better to build the article around this observation rather than around reading. Page layouts, column widths, displaying articles on one or on multiple pages, print versions, linking within the same site or set of sites – all of these things affect the way we read and are affected by the way we read (since site designers have to try to grab and hold our attention). The internet is not just some undifferentiated entity known as “the internet”; search engines don’t just pull up “the best” or “the most efficient” results at the top. There is a sense in which technology “uses” us, sure, but that shouldn’t obscure the ways technology mediates the way people interact with or act upon each other. That’s one of the reasons we use the word “media” right? (Or is that a false etymology?)

I read this

In Plato’s Phaedrus, Socrates bemoaned the development of writing. He feared that, as people came to rely on the written word as a substitute for the knowledge they used to carry inside their heads, they would, in the words of one of the dialogue’s characters, “cease to exercise their memory and become forgetful.” And because they would be able to “receive a quantity of information without proper instruction,” they would “be thought very knowledgeable when they are for the most part quite ignorant.” They would be “filled with the conceit of wisdom instead of real wisdom.” Socrates wasn’t wrong—the new technology did often have the effects he feared—but he was shortsighted. He couldn’t foresee the many ways that writing and reading would serve to spread information, spur fresh ideas, and expand human knowledge (if not wisdom).

and thought that what Plato and the article both leave out is the unreliability of memory and the ability to check it against a documentary record (which itself isn’t always reliable).

I read this

The arrival of Gutenberg’s printing press, in the 15th century, set off another round of teeth gnashing. The Italian humanist Hieronimo Squarciafico worried that the easy availability of books would lead to intellectual laziness, making men “less studious” and weakening their minds.

and thought of Ann Blair’s article about Early Modern information overload, which you can find summarized here [dead link removed] and here [dead link removed] (the latter link points to an ungated version on this page).

I read this

The kind of deep reading that a sequence of printed pages promotes is valuable not just for the knowledge we acquire from the author’s words but for the intellectual vibrations those words set off within our own minds. In the quiet spaces opened up by the sustained, undistracted reading of a book, or by any other act of contemplation, for that matter, we make our own associations, draw our own inferences and analogies, foster our own ideas. Deep reading, as Maryanne Wolf argues, is indistinguishable from deep thinking.

and was reminded, despite my skepticism about much of the article, how much I too value sustained reading. But having read some books online in the past two years, I don’t know that it has to be print-based.

I finished reading the article. I tracked down some links and planned to post on it in a day or two. I read some other things online. I turned off the computer and began reading this book, which I’ve been meaning to read since I mentioned it months ago. It could be years before I finish it.

[The original post can still be viewed here.]