mBsLOG

    Welcome to my weblog. It is an unconventional blog in that I am not planning to post daily or weekly, but only as topics of interest emerge. I enjoyed playing a little with my initials and the word blog and am amused by the fact that it is as much something I am slogging through as something I am blogging about. This listing only shows the five most recent posts.

    • Here is an index of all the topics with direct links to the post.
    • Here are the posts from 2007.
    • Here are the posts from 2008.
    • Here are the posts from 2009.
    • Here are the posts from 2010.

    I will try to discipline myself to keep a more or less regular set of reflections coming, but I can't promise. I have disabled commenting and discussion as it ended up being more maintainence and cleanup than I cared to deal with. That doesn't mean your comments and thoughts aren't welcome. Should you wish to comment on what I have said, I will be happy to add your comments verbatim so long as they are not spam. Simply send an email to me at Pitt -- see my home page. I will insert it in the appropriate post with attribution if you wish. Please reference the title and date of the post on which you are commenting. Also, if you want to suggest a topic that might be covered or discussed, let me know and I will try to include it.

    Here is access my mBsLOG as an rss feed.


    Sun, 20 Dec 2009

    Academic Life in the Green Zone (December 20, 2009)

    When an organization is concerned about the security of its digital assets, it creates a firewall to protect them. It is a rather poor analogy to a building or car firewall, but it will do. In theory the private network , devices, and data are safe inside the firewall. Again, also as a simplification, the firewall is generally considered as a system that prevents outsiders from coming in but allows insiders to venture out. (Some organizations also restrict outgoing traffic, e.g. to pornographic sites.) When some assets of the protected organization need to have exposure to the outside world, such as a website we want people to see, we place that asset in a “DMZ” or demilitarized zone. We feed it from the protected zone and take great care to protect it and still greater care to make sure that any damage that might be done to it is contained and easily repairable. This is a rather simplistic, but complete picture of many organizations today. Our firewall surrounds us and protects us from the “hostile environment” that the internet has become for many organizations. The exposed part of our digital presence is placed in a heavily guarded demilitarized zone. We live in a heavily guarded Green Zone. (On days when or protectors seem both overly aggressive and lacking in common sense, I am more likely to think of the Green Zone as the Militarized Zone.)

    There is a sense of déjà vu in this situation, and the parallels are not quite perfect. Prior to 1980, computing devices were expensive and shared. Specialized staff maintained them in secure zones. When I attended NYU in the late sixties, “regular” computer science students were not even allowed on the floor of the Courant Institute that housed the mainframe. Acolytes carried our trays of cards up to the second floor where Deacons put them in the job queue. The results of our efforts were returned as a computer printout the next day. What could be done with the time shared computers of the 1960s and 1970s was a matter of negotiation with the gods of the computer center. They chose the hardware, operating system, software, etc. They allocated precious disk space and job priorities. The resources allocated to you were allocated after consideration of the needs of the organization as a whole. Violate the code of conduct and you could be shut down. In the 80’s with the advent of PC’s, inexpensive mini computers, and specialized workstations, the world began to change. In academia, the development of networks and networks of networks – the internet began to change the very fabric of computing. First, each of us controlled and determined appropriate use for our own machines. Second, devices historically dedicated to computation began to be used for communication. With the introduction of the protocols for simple distributed information sharing – http or as we refer to the resulting structure, the World Wide Web emerged to challenge our entire vision of information processing. The computational fabric was moving from the world of data to the world of information – and would eventually begin to make inroads into the world of knowledge. The creation, dissemination, and use of information began to change. A new universe was here to replace the Gutenberg galaxy. But something else was changing as well, with the popularization the internet, email, and the web, business realized that a part of its business was also related to information. With digital music and photographs, high speed connections to the home and businesses seeing a new marketing channel, it was not long before the information highway was also seen as a financial marketplace. When the money started to flow, criminals moved in. And we found the need to create safe walled communities – green zones. With that overly long, but hopefully reasonably accurate description of 50 years of computing, we can turn to asking how academic life is impacted in the green zone.

    Academic life involves three kinds of activities, instruction, research, and service. I reflect here on the first two, although the third is undoubtedly impacted as well. I happen to teach things like “web standards and technology”, “e-business”, “e-business security”, “web services and distributed computing”. My research is focused on social networking systems and web-delivered medical interventions.

    In teaching, I need to help students to develop skills related building distributed systems that link organizations across the internet. For example:

    • Students I teach write programs more than they write papers. They write programs in Java, javascript, perl, c, etc. and mail them to me to be graded. Those of you who have been subject to email viruses know that one way hackers attack machines is be sending mail attachments that are malicious programs. So what do our protectors do? They delete anything that might be malicious content from incoming emails—and for most users they do an amazing job of eliminating only the malicious stuff. Guess what happens to student homework? We find more and more clever ways to work around the restrictions, and our protectors find better and better ways to protect us. Many of my colleagues use gmail to avoid these issues. Every time I hear this, I cringe as I think about the confidentiality of faculty student communication when we are forced to move it into a public space because our private space is protecting us too much.
    • My students should gain expertise in the development of programmatic elements of web services. Let me put it more simply. They need to learn how to process a shopping cart to process an online order, or to build an online help desk with chat facilities. This is done by writing programs that control the web interaction. Each student at the university is given facilities, controlled in our green zone, to build websites and see how they work. But because our institution is legitimately concerned about abuses, those websites are limited to the most simple forms – without programmatic capabilities. That is to say, students are allowed to put up static web pages such as was reasonable 15 years ago, but which are horribly antiquated by today’s standards. I am allowed to run my own servers, but there existence is frowned upon and viewed as a liability.
    • A business wants to know how its website is being used by clients. The design of a website should reflect a detailed understanding of how it is used. Fortunately, the design of web servers is such that they can be set up to keep detailed logs of the interactions that take place. These logs can be analyzed to see who is visiting us, what they are looking for, where they get lost, how they get from place to place, etc. It is useful to provide these detailed logs to students and to have them conduct analyses and make recommendations for improvement. Businesses do this on a regular basis. Unfortunately, the logs kept by the university on its website have been defined as private information and are sealed. Students can not look at it an assess it. (Indeed, I am not sure that the University officials who should be concerned about how it is used bother to ask for this data for analysis.)
    • One of the hottest topics today is social networking sites – like LinkedIn, FaceBook, and Flickr. These sites allow people to volunteer information for public exposure. Our students use these sites and our students, at least some of them, will be asked to build these sites. Perhaps the most exciting academic challenge in the design of social networking systems is the back end analysis of the data. For example, we can examine statistical correlations among users of the system, the contributions they make, and the patterns of their behavior. How do we conduct that kind of analysis. We mount a public site that provides some desirable service – such as recommendations for restaurants or events. We then allow people to use it and watch what they do. This presents a risk, and we have all read about inappropriate behavior on these sites. For this reason, the University frowns upon the development of social networking sites that are publicly accessible. Liability understood, how do we teach the next generation of developers to build better sites. We could do it in a totally controlled environment where only our students could access the site, but making it accessible to 20 students doesn’t teach us how to control it when it is being accessed by millions of people.

    Research can also suffer in the green zone. The issues are a little different and a few examples will serve to illuminate the scope of the problem.

    • Consider the development of a website that is intended for research on whether certain kinds of psychoeducational intervention will reduce symptoms and improve quality of life for persons with severe mental illness. The university, and the researcher would like the website to be secure and the data collected to be protected. This is reasonable, and required by IRB regulations and HIPPA law. To insure my server is monitored, the University would be happiest if it was built on their protected platforms, but the kinds of programmatic tools that support the website are not supported by the University system. So, we can develop the website if they don’t protect us, but that is not good. They can protect us if we don’t build the website, but that is a non starter.
    • By the way, how will we measure the use of a website by the subject and correlate it with the change in symptoms or quality of life? The answer is that we will register the users and track their use of the website. Remember those logs I talked about earlier. It is possible to add information about the subjects to the logs which allow us to track who read which page or read which bulletin board posting. There are a couple problems here. The University does a wonderful job of controlling access to protected websites for its members, but they have no category for subject in a research study – they appear and disappear too fast. In addition, if they kept this information in logs, which I don’t believe they do, they would not allow us access to them. Thus, we would not be able to do the hours of analysis that we now do to show that the web based intervention was directly related to changes in symptoms.
    • Research is not done in isolation. We have collaborators with whom we work at other institutions. It is not uncommon to want to provide access to data between institutions in an automated fashion. That is, machine A makes a connection to machine B and checks to see if new data has been collected by B. If so, it is transferred from B to A. This is done securely by what is called a “cron job” or “scheduled task” operating over an SSL(Secure Socket Layer) connection using public private key authentication. Trust me, it makes good sense and happens all the time. But we have disabled these functions for the users of the University system because they present a possible security loophole. We find ourselves in between another rock and a hard place.
    • One final example. For a variety of reasons, like HIPPA compliance and IRB requirements, we can’t send information about subject activity to researcher via email. (Actually, we could but getting researchers to use secure email is very difficult.) As an alternative, we can ask the researchers to check their secure website every ten minutes to check and see if anything has happened that requires their attention, but that doesn’t work either. One solution we have found is that when something happens on the website we can send a very simple email to the researcher, unsecured, that says their attention is needs. No private information. Just a note to say they are needed. They can then securely connect to the site and take care of business – an optimal use of everyone’s time. Recently, the University improved the security and quality of the email services by requiring the user to login to send mail. This protects us from program that spew unauthorized spam mail. Unfortunately, it also blocks our legitimate notifications.
    Having drawn a picture of some of the drawbacks of academic life in a green zone, let me make clear that I like life in a green zone. It makes many of the things I want to do easier. It provides me with 24x7 monitors for my systems, assures better power and internet connections, and gives me reliable automatic backup. Further, when we encounter most of the problems mentioned above, it is possible by getting to senior technical people to correct the problem. For example, when I raised the issue of automated email notifications, I was told that I could have my server “white listed”, i.e. by providing the address of the machine that would be sending notifications in advance, the controlling personnel would make sure the mail was still accepted.

    The issue, and the point of this blog entry, is that just as we had to develop a better form of communication between users and the “high priests” of centralized systems in the 70’s so today we need to provide better communications between the denizens of the academic community and the technical staff patrolling and protecting the green zone. The academic life is about exploring and developing the new world and sometimes this exploration can be stymied when those protecting us don’t understand what we are trying to do.

    [/2009/12] permanent link


    Thu, 24 Sep 2009

    The Need for New Standards (September 24, 2009)

    Recently, I was asked about the need for more standards. The questioner indicated that: "A study by RAND identifies multiple levels of interoperability that are necessary to run the Internet. The study states that, by 2000, fights on communication syntax (e.g., TCP/IP) have been settled, and there appear to be reasonably good solutions to the problem of knowledge syntax, and both solutions are based on open standards. RAND then predicts that semantics will be the next battleground, followed most likely by 'future fights at the service level.'"

    Once again, and for seemingly the 10th time in the last few months, I have been asked to revisit a research topic that has been on my backburner for almost a decade. Through the 1980's and 1990's information technology standards and standardization were an active area of research. Over the last decade, research has declined in my "technology" circles, but it has increased in public policy and business circles, particularly in Europe and the Far East. The question led to the following thought about standards and standardization, particularly related to information.

    Standards and standardization are dynamic. Both the goal standardization and the process of achieving the goal have evolved over time. The first thing to understand is that standards can be very different kinds of entities. Scholars suggest that standards fall into three broad categories -- measurement, quality, and compatibility. How bright is a light bulb or how intense is an electromagnetic filed -- these standards address how we measure such. From a quality perspective, we might want to know how good a motor oil is at a given temperature, or how pure some drug or food is. Finally, compatibility standards address building a plug that fits in an outlet or a mail note that can be read by a mail program. Below I suggest that there are three discernibly different standardization eras.

    The early history of standards setting had a lot to do with professional responsibility, international commerce, and business issues. These early "industrial age" standards were set slowly and carefully, often after a clear need for a standard had emerged. Some of the earliest standards were also demanded by a concern for public safety. Events like the great Baltimore fire in the US led to the elevation of standards for fighting fires. (Despite a "national" response to a fire in Baltimore, the city burned to the ground because the fire companies that responded from as far a way as New York City could not connect to hydrants because of non-standard fittings. Similarly, the oldest US standards organization (ASME) took responsibility for setting standards that helped overcome a growing number of boiler explosions. I refer to these as classic professional responsibility standards. Similarly, related to international commerce, there are two classic early standards setting efforts -- the measurement of electricity (what is an amp, volt, etc.) and tariffs related to international messages -- how do the fees associated with sending an international wire or making an international phone call get divided. Finally, the were early standards set to accommodate business needs -- my personal favorite was standard time required for operation of the railroads.

    As time went on, two phenomena occurred. First was the dawn of the information age. As telecommunications, computation, and networking grew to a major force in our world, so did the need to standardize the related operations. We sometimes forget that there were separate mail networks, dozens of competing operating systems, hundreds of different document standards, very distinct networking standards, and even different encoding standards -- i.e. EBCIDIC versus ASCII. While manufacturing era standards setting continued, information technology related standards became a critical and rapidly growing area of standardization. I have long held, but never done the grunt work to prove that IT standards in the period 1980-2010 have outweighed all other standards in terms of the pages produced, the cost to develop, etc. As this occurred, several corollaries emerged -- anticipatory standards versus market dominance standards, consortia versus SDOs, business interests versus engineering interests, etc. As if the need for faster, more complex, more interrelated standards were not enough, the growth of free trade agreements catapulted standards into roles that were minor historically, but became significant in a world where tariffs were frowned upon. All of these events encouraged an evolution from standards as technical solutions to standards as business instruments. (While I think this situation has resulted in standards that are less efficient public goods, I don't know that this is a bad thing. It is simply a reality.)

    Most recently, with the emergence of a true functioning global economy, and the nearing of the asymptote of instantaneous communication via email and the world wide web, the era on global instantaneous information transfer has become a reality. What became apparent when this occurred is that we did not have global, or even national, or even industry wide semantics to support that communication. In the "old days" (1960-1990), we had struggled with international standards for commerce -- EDIFACT (Electronic Data Interchange for Administration, Commerce, and Trade), directory services (X.500), Abstract Data Types (ASN.1 and BER). All of that disappeared in the heady simple globalization made possible by the WWW. Now, ten years later, we find ourselves lacking common shared semantics for everything from shoe sizes to privacy.

    The best model for how this shared semantics is developed is put forward in dribs and drabs by Michael Dertouzos and Tim Berners-Lee. Basically, what they say is that it will happen in a decentralized way and will grow as the need for it becomes apparent. Sometimes the semantics will be developed a priori and sometimes a posteriori. That is, sometimes a group will agree in advance to develop a common shoe size. Sometimes, we will agree not to and someone later will develop a translator. In this sense, a shared ontology or shared semantics will emerge as needed.

    The RAND report makes a reasonable case. I am not sure the labels "syntax", "semantics", and "service" help, but the basic premise has some validity. I wonder if we might not be better off in defining different categories of standards that support global commerce. Personally, I find it a little confusing when we equate something like the standard for addressing with a shared semantics for what constitutes a purchase order.

    [/2009/9] permanent link


    Thu, 23 Jul 2009

    Links (July 23, 2009)

    Over the last decade, I have worked with Armando Rotondi on a number of projects that explore the use of web technologies for various forms of medical interventions. Several of the studies are related to individuals with schizophrenia, and explore the delivery of Family Psycho Educational Therapy to individuals with schizophrenia. As a part of that study we designed a website, based on extensive usability studies with the subjects, that would be easy for them to use. We had done exhaustive literature reviews and found no clear guidelines for web accessibility for individuals with Severe Mental Illness (SMI). SMI would include not only schizophrenia, but traumatic brain injury, Alzheimer’s disease, attention deficit hyperactivity disorder, and other conditions. We are currently working on a project for the Veteran’s Administration in which we will do a controlled study of the usability of more than 4000 controlled website designs that will be varied across twelve dimensions. We are examining things such as reading level, number of links, link construction, page length, etc. It is pretty easy to begin to develop a list of design guidelines if you consider the following simplified summary of the ideas that are generally supported in the literature. Individuals with SMI may:
    • have difficulty with abstract reasoning
    • be easily distracted
    • have a low level of reading comprehension
    • be easily confused
    • have difficulty translating specifics into generalities
    Having had good feedback on the website we designed to be as flat as possible using very explicit links, we are now hoping to be able to experimentally define concrete metrics that can be used by designers to build websites that are more accessible to individuals with SMI. As one might guess, several of the working hypotheses have to do with links. Given the earlier work where we had developed a system that used links that were very explicit, and were generally longer, we decided that one condition in our current study would be the number of words in links –links of length 3 or less versus links of length 7 or more. This gave us a very concrete metric that could easily be communicated to designers if validated by the study. As we began to write the code that would generate the sites by adding and subtracting words, we noticed that something was wrong. The original site we had developed had many links such as:
    • How to talk to your doctor about medication side effects
    • What are the side effects of my medications?
    • How can the different mental health professionals help?
    • Who treats schizophrenia in my area?
    But we also had links such as:
    • What is schizophrenia?
    • What causes schizophrenia?
    • Can anyone get schizophrenia?
    In designing the test sites, we set as one condition links of 3 words or less versus links with seven words or more. We realized that some of the links that we made longer were no longer as clear as they had been. Going back to think about how we had abstracted the dimensions to be tested, we realized that the links we had constructed were concrete and explicit. In general, that made them longer, but sometimes a very concrete explicit link was actually very short. For example, “What is Schizophrenia?” was concrete, explicit and short. The dilemma for developing guidelines is that “explicit and concrete” is harder to quantify than links of seven or more words. In addition, as we examined our generated pages we realized that we had not lengthened the links in a constant navigation bar we placed at the top of every page. The constant navigation toolbar is used in all the sites we have designed for various medical research projects. The sites have between five and seven items in this bar such as “Home”, “Library”, “Journal”, “Chat”. They contain what I think of as a safety net. (You can always get home.) In general, we try to use a number of pages that an individual would want to jump to from anywhere. Other researchers refer to these pages as landmarks. In general, the labels associated with these links are one or two words. They also contain the most important hub pages. For example, a lot of our work has to do with online support groups where users spend as much as 90% of their time. For these sites “Support Groups” is an element in the constant navigational toolbar. This leads to a more precise operational statement about the links designed for individuals with SMI. First, links that are designed to assist a user in answering a question or obtaining specific information should be as long as they need to be to explicitly articulate the content that will be found if the link is traversed. They should be constructed such that they are simple sentences if possible. Any words or phrases in the statement that add extraneous on confusing should be removed. Second, the site should provide a direct access/recovery mechanism that is constant across all pages on the site. The goal of this navigational feature is not discovery but recovery. The goal of the navigation bar is to provide direct access or recovery of a known state (e.g. the home page). It provides the ability to simply and quickly hyperjump to a location of regular and known value. The most obvious of these is the “Home” menu item. Others might include things such as “Help” or “Contact Us”. Direct access buttons would include items that a user might want to get to from any place on the site. Again, these will depend on the nature of the site and the purpose for which people use it. These links should be as few as possible, as clear as possible and as simple as possible. There may be as few as one or two and as many as six or seven.

    [/2009/7] permanent link


    Thu, 18 Jun 2009

    The Federation and Balkanization of Information (June 18, 2009)

    There is little doubt that we are learning new ways to use information to transform economic and social enterprises.  In e-business, one of the key concepts I teach is the notion of replacing inventory with information.  The lecture is long and detailed, but let it suffice here to suggest that it pretty easy to see that inventory represents an investment of money and that it costs money (storage space, pilferage, etc.) to store it.  If we have perfect information about our needs, we can manage inventory on demand.  Thus, we replace expensive inventory with cheap information.  In similar ways, the use of the internet is having a dramatic impact on our social system, from politics to health care.  In the midst of all this, we all have a clear sense of what information is, but good scientific definitions elude us. I believe it is important that we have a better sense of how measure information, to determine its worth, to understand how it flows and is transformed, how it is aggregated and balkanized, etc.  Information balkanization is only one manifestation of the effort to control and manage information.  In a sense balkanization of information has served to some extent to protect information about me, but there are signs of information federation that allows partners to share information.  I fear that costly balkanization will soon give way to revenue generating federation.

    Definitions of Information

    Definitionally, there are a variety of ways to establish information metrics.  It is theoretically appealing to fall back to Shannon’s definition of information as a measure of the entropy in a signal.  Unfortunately, this definition does little to inform economic or social policy.  While information can be measured objectively in terms of entropy, it is the impact it has on people and systems that may be more critical.  The common sense social definition – i.e. information is that which I don’t know already – is interesting in that it makes all measure of information relative.  (I suspect you already knew that.)  We might try to be more formal and say that data that causes a change in the state of the receiving system is information.  We might try to build a more complete model and suggest that information encoded into a system constitutes knowledge.  This approach is appealing in that it links the definitions of data, information, and knowledge.  It is unappealing in that it lacks a quantitative metric, and even if one existed, it would need be heavily influenced by the individual receiving the data.  (What is information to you may not be information to me.  At the same time, we might both share the same knowledge.)

    Public versus private information

    One of the interesting problems we face today is the aggregation of information about us on the web.  Much of this information is balkanized,  and some of it is more federated than we know.  What is more interesting is that the information about us includes both public information about us that we share (a photograph) as well as private information we may volunteer in the anticipation that it will be kept confidential (a credit card number or our birthdate).  But there is also information about us that can be gathered from our clicks that reveals things about us that we might not know.  This brings to mind the “Johari Window” proposed by Joseph Luft and Harry Ingham in the 1950s as a conceptual model which was used in counseling and self help groups.  They suggest, in a narrow context, that there are four categories of information about an individual:
    Johari Window Known to self
    yes no
    Known to others Yes open blind
    No hidden unknown
    It is surely the case that all four types of information exist on the web.  We have discussed three of them already.  The fourth, unknown, is simply waiting for the right data mining technique.

    Ownership and shared information

    This leads one to interesting notions of the ownership of information and the location of information.  Perhaps the most interesting information in this category is medical information.  Consider an individual’s medical record.  Who owns this record?  Is it the physician who created it?  Is it the patient it is about?  The matter may be further complicated by considering the components of a medical record.  Who owns the following:
    • An x-ray of my lung.
    • A reading of the x-ray
    • A diagnosis based on the reading
    I would like to believe that my medical record, my employee record, my education record are all my property, but it is not clear that is the case.  I don’t believe any mail system suggests that the author owns the content.  I know at my University it is the University, not me, who legally owns it.  The challenges to the perception that it is my personal infomration are few and far between and generally meet acceptable social criteria for the intrusion, but they are there.

    “Order” of informaiton

    Some information is primary – for example, one might consider the statement that it rained in Pittsburgh primary.  This could be called first order information.  Collecting these pieces of information for a year, one might derive a piece of second order information such as 2003 saw rain on 30% of the days in Pittsburgh.  Collecting this information for several years, one could derive a piece of third order information – over 20 years, the average number of rainy days in Pittsburgh is decreasing.   Information might also be categorized in terms of how it is derived, and this might impact other properties – such as ownership.  For example, the fact that a person has a certain height or weight might be defined as first order information.  The representation of that information as a certain number of inches or a certain number of pounds might be declared second order information.  The fact that a person is “overweight” is obtained by a function of height and weight in accord with some algorithm.  Is the information that a person is “overweight” their information or does it belong to the person who applied the algorithm?

    High grade versus low grade

    Some information that describes me explicitly – my height, my medical condition – might be considered high-grade information about me.  Low grade information might include:
    • What books at amazon.com at which I look
    • When I am logged onto the internet
    •            
    • What Microsoft applications I use, or what I ask for help about
    Do I have the same right of ownership of high and low grade information.  Indeed how is the ownership established?  When low grade information is anonymous and aggregated, who owns it?  Generally, I do not control this kind of information about me.  When is information explicitly or implicitly transferred?  As with the shared ownership issue, this category again asks who actually owns what portion of the information.

    Operations on Information

    We lack formalizations for operations on information.  While we know how to copy artifact that contain information, or delete them, it is not clear how do we determine if two pieces of information are identical?  How do we “subtract” or “add” two pieces of information?  How do we measure the change in information after transformation.  How do we transform information from first to second order?

    From an end-user perspective, the operational problems with information include such things as understanding the ownership of information, dissemination of information about individuals, tracking of information flows, etc.  A variety of disciplines may contribute to more operational and rigorous definitions of information. 

    We might be concerned about keeping private information private – how do I secure information about me.  How do I control information I divulge to others?  How do I insure it is not reused without my permission? Is there a difference between selling my email address to others, telling others I am looking for a mortgage, or telling others I work at a University?

    What is the role of public-private key encryption in controlling access to “jointly-owned” information – e.g. a physician can keep information about me, but it may only be released when accompanied by both my and the physician’s private keys.  How does this impact medical research?  How does this impact emergency care?  What happens to my information when I cease to exist?  How does my privacy impact the public right to be safe?

    Can information be doped?  Explosives are tagged or doped with trace chemicals that allow the origin of the explosive to be traced in the event that it is used in a crime.  Like some compression, the doping process is asymmetric.  It is low cost to dope an explosive and high cost to trace the tags.  Like compression, the basic idea is to keep the costs of the most frequent operation low and allow the costs of the less frequent operation to grow.  Thus, when a file is infrequently compressed and very frequently decompressed, the cost of the compression can be high while the cost of decompression is kept low.  How might doping or watermarking be used trace illicit use or dissemination of information?

    [/2009/6] permanent link


    Tue, 24 Feb 2009

    Computational Thinking (February 24, 2009)

    In the February 2009 Communications of the ACM, George H.L. Fletcher and James J. Lu make a compelling argument for introducing Computational Thinking in the K-12 Experience (pp 23-25, Communications of the ACM, 52(2).) They reference Jeannette Wing’s call for Computational Thinking put forward in the Communications of the ACM (49(3), March 2006, 33-35.) Both of these pieces provide compelling and well reasoned arguments of which I am supportive. I would like to take a few moments to articulate what I envision as a complimentary discussion of this topic.

    As readers of this blog will know, I have a pet theory about the emergence of a new form of communication based on the use of digital technology – which I call immediacy. (see http://www2.sis.pitt.edu/~spring/mBsLOG/blosxom.cgi/2007/09/30#immediacy) The theory is far from my own. I like to think it has some twists that I have contributed, but it builds upon the brilliant insights of Walter Ong put forward in Orality and Literacy: The Technologizing of the Word (Walter J. Ong, Orality and Literacy: The Technologizing of the Word. New Accents. Ed. Terence Hawkes. New York: Methuen, 1988). At a more conversational level, I found Denning and Metcalfe’s Beyond Calculation to be an insightful collection of essays describing the future of computing – with what is beyond calculation being communications (P. J. Denning and B. Metcalfe, Eds., Beyond Calculation. New York: Springer-Verlag, 1997.) I would like to put forward two propositions on computational thinking, one related to communications and one related to information, a concept which I have addressed elsewhere (see http://www2.sis.pitt.edu/~spring/mBsLOG/blosxom.cgi/2008/09/25#Information)

    Related to communication, I firmly believe that a new and rich form of communication complimentary to oral and literary communication will emerge over the coming centuries. It will not supplant orality – we will still tell stories in the epic traditions. It will be a sad day if we come to a time when our experience and the telling of it is not rich with exaggeration, abstraction, repurposing and other aspects of the oral tradition. At the same time, the educational system will need to develop techniques to help students come to grips with the awesome power of immediacy in the same way it helps us to become conversant with literacy. This will involve education in literacy both in terms of mechanics – like writing and composition, and examples – great literature. I have a suspicion that the Beowulf of the era of immediacy may well be Gordon Bell’s Life Bits project. (see http://research.microsoft.com/en-us/projects/mylifebits/) It is a great and concerted effort to develop a monumental collection that will tell the story of one man’s life and efforts to communicate it.

    Somewhere in the future, a child will be born whose whole life experience will be captured and digitized so as to form and intimate memory supplement far beyond anything that Vannevar Bush might have imagined for his elite scientists. As time goes on, children will master techniques for capturing, organizing, mining, and sharing that personal experience. Some will try and fail and their communications via the integrated technology will be noisy, uninformative, and boring. Other will achieve a simplicity, clarity, elegance and intensity that will allow the receiver to experience the world and its structure in a way they might never have been able to before. I might imagine that a future reader of this piece might exclaim that this presentation is a horrible waste of the power of the communication medium with its stodgy use of the literary form in this much more expressive medium. All of this is to suggest that while I concur with Fletcher, Lu, and Wing that we should be considering how computational thinking might be integrated into the curriculum, we also need to be thinking about how computational communication should be woven into the curriculum.

    Related to information, I would like to look at the issue of computational thinking from the perspective of information. A colleague of mine, Kai Olsen, has been a great proponent of the thesis that we can use the computer only in those areas in which we have formalized knowledge. I suspect that Kai would soften his hypothesis a little after all these years, but the basic tenet would still hold. If we don’t understand something, it is difficult to instruct a computer on how to handle it. The development of the social aspects of the web and the emergence of what some call collective intelligence suggests that first order knowledge may be derived using formalisms applied to secondary phenomena. Let me be more explicit. When the page rank algorithm is used we can identify a page that is likely to be relevant to a query, not based on the query, the person executing the query, or the resource identified, but based on the number of “important” links to that resource where the importance of the links is recursively defined for all of the sources from which those links emanate. Similarly, algorithmic processing of tags, or annotations as I view them more generically, may provide for a taxonomic, or folksonomic, classification of a resource.

    Jeannette Wing articulates a brilliant set of examples of computational thinking, well worth the read. Of her half dozen everyday examples, my favorite is “Which line do you stand in at the supermarket?; that’s performance modeling for multi-server systems.” Whether it is caching, prefetching, redundant design, etc. Wing masterfully suggests ways of thinking about our world that not only enable us to live better, but provide the foundation for formal algorithm and paradigm development leading to the ability to write “good” programs later in our educational endeavors. If all my graduate students had been children of Jeannette Wing, there is no doubt that teaching client server systems would not be the enormous challenge it is today. With no intent to denigrate the shear brilliance of Prof. Wing’s insights, I would like to believe that there is some utility in seeing patterns in raw information. It may be that I am simply suggesting that I would like a more inductive approach to seeing patterns in information to compliment her approach which I liken to finding patterns in information as examples of formalisms I already know.

    Information is not noise, and noise is not information is a simple, but powerful, tautology. Do we see anything in the pattern of tags that are used to classify an image, or in the annotations associated with comments on a draft document? Recently, I was looking with some of my doctoral students at tags associated via delicious with a set of resources. We have been searching for algorithms that will reliably separate “informative tags” from “noise.” Our goal in this effort is not to develop a description of the resource, but a classification. (Use of annotation to aid in description is, I believe, a somewhat simpler task than finding terms that aid in classification.) Given a wealth of information, we undertook many of the types of computational thinking that Wing, Lu and Fletcher have suggested. We identified tuples, formalized terminology and identified relationships. And in all of this, we made some small progress. One day, feeling good about the progress we were making, we looked at three lists of terms for a given resource. The first seven of 75 rows were as follows:
    Annotation Frequency(AF)AF*IRFInverted Resource Frequency (IRF)
    StorageStorageFilestorage
    ToolsOnlinestorageOnline-storage
    BackupBackupOnlinestorage
    WebSyncStore
    OnlineComparisonFilehosting
    OnlinestorageSharingHarddrive
    SharingLifehackerDrop.io
    …..…..….
    Our goal was to model and modify existing techniques used elsewhere to see if we could bubble better classification terms to the upper ranges of a list. It may not be immediately apparent from this particular example, but the terms in the AF*IRF column, on average, tend to be better than the terms from the AF column. However, the reason for showing this example lies not in the final results, but in the intermediate results, and this is a good example of the phenomenon. Remember, our goal is classification terms based on a set of annotations provided by many users. Before reading on, take a look at the IRF column and decide what you see that may be useful……

    Ok, I assume you found one of the two things – the second is not very obvious in this example. Hopefully you see “filestorage” and “onlinestorage”. One of the constraints (a form of computational thinking) is that any terms provided by users are separated into a space separated list. To avoid having the terms file and storage placed as separate terms, users often simply concatenate them or hyphen join them. What appears – we are in early stages of confirming our findings – is that these compound terms appear to take a subcategory-category form. That is to say, if we split the words back up, the second term is the main category and the first term is a subcategory. From a classification point of view, this is a rich data source that would be lost using all of the methods we had been using to identify relevant terms. There is actually much more that I am not presenting here related to how these are found numerically and why it is surprising that we found them here, but if further experimentation supports this and a few other hypotheses we have, it will be an important basis for some dissertation work. The key here is that it was not so much the application of computational thinking that was critical. Surely, without the use of computational thinking, we would not have gotten to this point. At the same time, it was an observation about the data or information that leads to the critical insight that leads to possible new discoveries. I would suggest then that in addition to finding existing patterns in the world, another aspect of computational thinking is the ability to see new patterns in information.

    By the way, you may be curious to know what the second finding was. If you found it, you are to be congratulated. This example is not so clear as others we have found. The second observation is that IRF surfaces things such as “drop.io”, a proper name – the name of a particular online storage system. If we had shown other examples, it would become clearer that while proper names seldom show up in the AF column, they often show up in the IRF column. Thus, we are tentatively observing that under certain conditions, the IRF column may not only show a superordinate and subordinate classification, but exemplars. Most importantly, these annotations are not as immediately apparent in the places we would have expected to find them.

    [/2009/2] permanent link


    Fri, 09 Jan 2009

    Developing Metrics for Annotations (Tagging) (January 9, 2009)

    Introduction and model

    I should begin by explaining that I am going to use new (old) terms here that may be bothersome to some, so let me explain. This posting is about developing new metrics that may be useful in understanding the social networking world where individuals contribute content. This would include tags on flickr photos, or descriptions of delicious bookmarks, etc. The contributors of this information will be referred to as Users and that should not be a problem for most. Further, for consistency with other semantic web efforts the various information objects that are the subject of our efforts are defined as Resources. The information that is associated with a resource is defined most broadly as an Annotation. Some use the term tag, but a tag is only one forma of annotation and the study of annotations precedes the web and provides some rich history to guide the investigation.

    In developing metrics for Annotations, it may be useful to look at similar metrics that have been used in other situations. In information retrieval and text mining, term weighting is often used to establish the ranking or relevance of documents. Term weighting is computed from two factors: term frequency and inverse document frequency. Most simply put, term frequency is a normalized value calculated as the number of times a given term occurs in a document divided by the number of words in the document. Thus the term frequency of termi in documentj is calculated as:

    The denominator is the sum of the occurrences of all terms in documentj.

    The inverse document frequency is the log of the number of documents in the corpus divided by the number of documents that contain the term:

    Keep in mind that we are looking at TF-IDF as a model for thinking about new metrics. Thus it is helpful to review the model at the highest level. In the case of documents or document entities, we observed that a document may be envisioned as an entity that is attributed with the counts of the terms that exist in the document. In other words, a document is a tuple whose attributes are the counts of the terms. Over a collection of documents, we can assess the frequency with which a term occurred in a document and the extent to which the term was found in the documents in the collection. The formulas described above were developed to provide a quantification of these values. The resulting product of the two measures provides an indication of the degree to which a given document is “important” related to a given term.

    New entities and tuples

    This leads us to ask what new metrics we might use for environments such as social networking sites where we shift from documents as first class entities to Users as first class entities. As is the case for TF-IDF, we will look both at the entities and their attributes and at the collections as a whole. At the core, we know that we will have at least three things that may be related, and there may be more. The three universal objects are:

    • Resources
    • Users
    • Annotations

    These three entities may be primitive or attributed. In the case that the Resource, Annotation, or User entities contain no useful information, we consider them primitive. To the extent that further information may be obtained directly or indirectly, we consider them attributed. In general, one or more of these entities are attributed. As primitives, consider that a URL (representing a resource) of the form:

    http://136.142.116.1/access?id=12345

    or a user-id (representing a user) of the form:

    1234567

    or an annotation-id (representing an annotation) of the form:

    1

    Assuming that a DNS lookup on the IP address provides no information, from just this information it would be hard to ascertain any information about these entities. On the other hand, the following entities would appear to provide some attribute information:

    http://www.sis.pitt.edu/index.html
    Michael_B_Spring
    Address

    We may infer with some confidence that the URL is within the edu domain and that the resource is an html page. We could infer, with less certainly, that Michael B. Spring is the name of the user and that the annotation refers to an address. Finally, we might find these entities as tuples of known data such as:

    http://www.sis.pitt.edu/index.html, main page, school of information Science, University of Pittsburgh
    Michael B. Spring, Associate Professor, School of Information Sciences, University of Pittsburgh
    Address, office, vcard format, xxx,xxx,xxx

    In this case, we have a much richer set of attributed objects. Some simple metrics with beginning formalisms Given primitive or attributed objects, what are the metrics that might be of use? We can begin with the most obvious. Imagine that any given Resource(R) has a series of Users(U) provided Annotations(A) associated with it. This might be the case for flickr and is even more so the case for delicious. We begin very simply with a couple of observations.

    • For any given Resourcei, there are one or more individuals who have associated one or more annotations – tags.
    • For any given Annotationj, related to the Resource, the Annotation has been specified by some number of Users.
    • Cross User Annotation Frequency (CUAF) might be a useful metric providing a measure of the percentage of Users who used an Annotation to describe a Resource:
    • Annotation Dominance (AD) might be useful a useful metric providing a measure of the extent to which a single term was used to describe a Resource:
    • Cross Resource Frequency (CRF) might be a useful metric providing a measure of the extent to which an Annotation is used across Resources:
    • Cross User Frequency (CUF) might be a useful metric providing a measure of the extent to which a given Annotation is used by multiple Users:

    On the surface, all of these metrics, which mimic some of the things done by TF or IDF appear to provide some meaningful kind of metric, but as they are exercised against data, they begin to pale as meaningful measures. What we need to do is look more carefully at the simple semantics behind the calculations.

    Digression on Semantics

    After several discussions with PhD Students on the formalisms put forward above and several dozen other variations, we were forced again and again to try to more clearly ask what it was we are trying to uncover or discover. This leads to the following preliminary discussion on the semantics of Annotation analysis.

    Let’s ask simply what term frequency means conceptually; we will try to avoid formulas and speak simply about the semantics. If we have a termi and a set of documents, it will be true that the termi will occur some number of times (including zero) in each documentj. A term could appear 5 times in a very long document and 5 times in a very short document. We need to account for this as it makes sense that this would make a difference. So, rather than use the count, we can use the percent of the document made up of these terms. In general, we would not expect any term to make up 100% of a document or even a majority of the terms in a document. We might expect the percentages to be very low. Never the less, a higher percentage would suggest that a document was more related to the term.

    Now let’s turn to Annotations. In a simple scenario, we might imagine that if 100 Users assigned an Annotation to a Resourcej, and if each User used only one Annotation, and if all of the Annotations were identical, that Annotationi would likely be a descriptor of Resourcej. (If each of a hundred Users used a different Annotation for Resourcej we would not be moved to suggest that any particular annotation was a good descriptor. Thus, Annotation Frequency is not the same as Term Frequency. In the ideal situation, unanimity about a singular tag would likely provide a good description. But things are never this simple. We have at least two dimensions on which Annotations related to a Resource might vary. First, it might be the case that only some Users who Annotate a resource used a particular Annotation. Second, it might be the case that Users provide more than one Annotation for a Resource. The situation is also complicated by the fact that User Annotations are generally weighted flatly. That is, a user may not use a term more than once or assign it an explicit weight. Some have suggested that Annotation order is an implicit indicator of importance, but other research has questioned this assumption. More insidious yet, many systems provide recommended Annotations, based on existing Annotations, which may induce a bias based upon early Annotations. Thus developing a good metric for the importance of a given Annotation is not as simple as Term Frequency. Consider a couple approaches to measuring Annotation descriptiveness.

    In the first approach, we take the count of a given Annotation and divide it by the count of all of the Annotations that have been applied to the resource. Let’s consider three situations. S1: If there is perfect unanimity using singular Annotation, the result will be one. S2: If 10% of the users used the Annotation and 90% used some other Annotation, the result would be .1. S3: If 100% of the users used the Annotation, but they also used 9 other Annotationss, the result would be .1. This appears to be a little less than we might hope for. As an intuitive level, it would appear that the metric should yield a higher value in S3 than in S2.

    A better approach might be to measure the use of Annotations by each User, and average those uses. Again, we begin by considering three situations. S1: If 100% of the Users used one Annotation, and that Annotation is the target Annotation, each user would have a ratio of 1, and the average would be one. S2: If 10% used the Annotation, their individual measures would be 1; the 90% that used some other tag would have usages of zero. The average would be .1. S3: If 100% of the users used the tag, but they also used 9 other tags, the average would still be .1. This doesn’t seem to get us very much related to S2 and S3. On the other hand, consider S4: where 90% of the users used one tag and 10% used 10 tags. The first approach would yield a value .5 while the second approach would yield a value of .91.

    One final approach, potentially becoming too complex, would be to find a way to weight the terms so as to differentiate S2 and S3 without having a negative impact on the measures developed for S1 and S4. Consider a system that weights each Annotation based on its penetration. For example, imagine an Annotation-Weight that is equal to the number of times an Annotation is used by the Users who Annotated a Resource divided by the number of Users. We build on approach two. Let’s look at the four situations. S1: If 100% of the Users used one Annotation, and that Annotation is the target Annotation, each user would still have a ratio of 1, and the average would be one. S2: If 10% used the Annotation, their individual measures would still be .1; the 90% that used some other tag would have usages of zero. The average would now be .01, because the weighted value of the annotation would drop from 1 to .1. S3: If 100% of the users used the tag, but they also used 9 other tags, the average would still be .1. This does seem to get us somewhere related to S2 and S3. S4: if 90% of the users used one tag and 10% used 10 tags, the second approach yielded a value of .91. Under this new metric, it would yield a value nearer to .81, which would also seem more reasonable.

    There are the other scenarios that we might imagine that are descriptive of User Annotation that indicates good and bad Annotations that describe resources? Using some of these simple scenarios, we might be able to cobble together a semantic conceptualization of what is good and what is bad. Above, I have suggested some preliminary steps in this analysis which seem to provide some differentiation. This needs to be taken several steps further. If we can accomplish this goal for “Annotation Frequency”, we can turn to the thornier question of whether there is a companion measure (equivalent to IDF) that provides additional refinement of the value of a Annotation in distinguishing relevant resources from non-relevant resources.

    Closing/Opening Comments

    In closing this preliminary discussion, let me list some of the issues that any metrics will eventually need to take into account:

    • As we have already stated, the perfect scenario consists of one in which many people use an identical single tag to describe a resource.
    • People tend to use functional as well as descriptive tags – such as “*” or “read” that might be identified as stop words for our purposes here.
    • While TF-IDF imagines infrequently occurring terms are good, for tagging purposes, tags associated only with a single user might be imagined to be idiosyncratic rather than descriptive and might be eliminated.
    • There will likely be an interplay between general and specific concepts in tagging that might make it useful to find broader and narrower terms as related some how. For example, if 50% of taggers used Java as a tag and 50% use “Programming Language”.

    There are other considerations I am sure, but this piece has already extended well beyond what I had planned. It is clear from this exploration, at least to me, that there is some room to explore possible metrics which might be experimentally verified or debunked.

    [/2009/1] permanent link



     
     

    Accesses since January 1, 2007: