DMR Upgrade Project 2013
From Ball State University Libraries Wiki
(→Browse by Subject, Location, Format, and Contributors)
|Line 278:||Line 278:|
====Browse by Subject, Location, Format, and Contributors====
====Browse by Subject, Location, Format, and Contributors====
===Class B (Important but will not prevent upgrade)===
===Class B (Important but will not prevent upgrade)===
Revision as of 13:07, 3 April 2013
The DMR Upgrade Project 2013 is an initiative by LITS to upgrade the Digital Media Repository's underlying CONTENTdm system from Server Version 184.108.40.206 to 6.3 and from Website Version 4/5/? to version 6.3.
Because of the necessary work involved in carrying forward many major customizations (many of which have been folded into the official release), there is significant need for restraint during this and future upgrades to prevent "over-customizing", or implementing such customizations that will impede or even prevent future upgrades.
In the newest version of CONTENTdm, a set of features has been provided to allow a large number of customizations to be implemented and easily carried over into future versions, with minimal work required to re-implement them. We will be taking full advantage of these features during this upgrade project.
The ETA for this project will be the week of May 6 to May 10, 2013 (the interim period between Spring and Summer Semesters). The current implementation plan is to set up a fully functional duplicate LIBX server and simply "swap it out" with the production LIBX server.
This upgrade process will require several required features to be re-implemented in a sustainable way (one that will provide the least amount of resistance during future upgrades) and, of course, may affect other systems that rely on the Digital Media Repository (via API, Harvest Points, or otherwise). This article will document all of these features and systems, each of which will be annotated with technical/developer notes regarding their implementation, for future reference.
Nomenclature and Commenting
There are currently three servers running CONTENTdm in the library (soon to be four). To prevent confusion, these will be referred to in this document as follows:
|DMR Server||LIBX||http://libx.bsu.edu/||Our live DMR server|
|BoT Server||LIBCDM2||http://libcdm2.bsu.edu/||Board of Trustees Minutes Repository using newest version of CONTENTdm.|
|Test Server||LIBCDMTEST||http://libcdmtest.dhcp.bsu.edu/||Test server for the newest version of CONTENTdm. (Available on-campus only.)|
A listing of staff who will be directly involved with the upgrade process in some way.
Class A (Required for Upgrade)
User based access control for specific collections.
Currently on the DMR Server this feature is used in three specific ways: The Architecture Images Collection, BSUBrowse and BSUSearch.
- The Architecture Images Collection
- When a user attempts to browse or search this collection, they are prompted with an End User Copyright Agreement and login form. The login form authenticates the user based on AD credentials and, once authenticated, allows the user to browse and/or search the collection.
- There are some minor security issues with this method. While the EUCA page successfully prevents users from seeing any metadata (including thumbnails) associated with the collection in most circumstances, if a user performs a repository-level search (searching across 2 or more collections), it is indeed possible for results from this collection to appear. While users must still enter their credentials to view any specific items returned this way, the thumbnails of these copyrighted images, as well as some of their associated metadata, do appear in the results list. Example results for searching two collections for the word "a" in any field
- BSUBrowse and BSUSearch
- By default, unpublished collections are not accessible in any way to anonymous users. In order to browse or search these collections, users must first authenticate with their BSU credentials via the two links at the bottom of the DMR homepage labeled BSUBrowse and BSUSearch. Both links authenticate through the server (there is no login form), but redirect the user to either the browse or search pages. Once authenticated, there is custom code that allows the user to see unpublished collections.
While the existing solutions work fine, the code behind them is messy and the login methods are not consistent (or necessarily secure). A new method that provides a single login form and is built on top of the existing CONTENTdm authentication code has been found that can easily and more reliably replace the two methods listed above.
The out-of-the-box login feature of CONTENTdm 6.3 was designed specifically for administrative users and users who have permission to submit content. It was not designed as a way for end-users to log into the system or to protect anything except unpublished collections. This default functionality would work well for our internal users (such as MADI staff).
The proposed solution (which has been implemented on the Test Server) is to modify the default login page to accept either CONTENTdm credentials (and thus, those domain users who have been specifically added to CONTENTdm as administrative users) or BSU credentials via LDAP authentication.
The solution first checks to see if the user is registered with CONTENTdm and if so proceeds as normal. If not, it then checks to see if the user's credentials work via LDAP and if so, sets the user at a separate level of authentication that allows them to see certain, hard-coded, collections that would normally be hidden, but completely dissociates the user from any administrative privileges.
The only minor drawback to this solution is that it completely hides protected collections from anonymous users, meaning they won't know of the existence of these collections unless they log in. This problem can be easily solved, however, by providing helpful tips on the customized home page (and other pages next to the "log in" button) that would inform the user of the benefits of logging in. We could also highlight these protected collections on the homepage with a clear message stating that users must log in before they can view and/or search them.
For unpublished collections, where BSUBrowse and BSUSearch would come into play, because the group of users who need these permissions is small, we can use the out-of-the-box login feature of CONTENTdm to handle these on a case-by-case basis. If a user absolutely needs to be able to see unpublished collections, but must not be an administrative user, we could simply create a dummy collection, add them as a CONTENTdm user, and assign them minor rights to that dummy collection. This would enable these users to see unpublished collections, but won't give them any administrative permissions, without changing any code.
In order for PHP LDAP functions to work, the appropriate php_ldap.dll file from the same version of PHP CONTENTdm runs on must be used. CONTENTdm is using PHP 5.3.3, which uses the php_ldap.dll from the PHP 5.3.4 VC6 release which can be found at the PHP for Windows website. The php.ini file must be edited to include it as usual. The relevant location is OCLC\CONTENTdm\Content6\common\php533\ext\.
The relevant code for all authentication is in OCLC\CONTENTdm\Content6\website\cdm_common\cdm\controllers\LoginController.php around line 110, the authAction() function, and OCLC\CONTENTdm\Content6\website\cdm_common\cdm\controllers\LogoutController.php around line 106, the end of the indexAction() function.
The code that handles whether or not the "log out" link appears on the page is located in OCLC\CONTENTdm\Content6\website\cdm_common\cdm\layouts\scripts\default.phtml around line 286 at the "if(@$_SESSION['authenticated'])" condition.
The code that controls which collections appear (and thus are available to the user to browse and/or search) is in OCLC\CONTENTdm\Content6\server\docs\dmscripts\DMSystem.php around line 59 (the dmGetCollectionList() function). Unfortunately this file doesn't get any $_SESSION data sent to it. Instead it gets called through a series of other functions that eventually end up using the API code located in OCLC\CONTENTdm\Content6\website\cdm_common\cdm\models\CdmApi.php around line 698 through the api_get_collection_list() function. This function does have access to the session and calls the dmGetCollectionList() function, so we can modify the result it receives to filter out results that shouldn't appear unless a user is logged in as an LDAP user.
To keep consistent with OCLC's file structure, the file containing the list of collections that are only viewable by LDAP users is located at OCLC\CONTENTdm\Content6\server\conf\bsupriv.php.
In summary: to allow BSU users to log in to the system and view collections that are only accessible to BSU users, and to provide this feature through the existing log in tool, the bsupriv.php file must be created and updated with the aliases of those collections, the LoginController.php and LogoutController.php files need to be updated to handle LDAP authentication if standard authentication fails, the default.phtml file needs to be updated to show the "Log out" text even when a user is only authenticated via LDAP, and the CdmApi.php file needs to include the bsupriv.php file and filter out collections from the general public.
Protected High Resolution Images
Special server-based access restrictions on a collection of images.
Currently on the DMR Server this feature is only used for the Art History Images collection. When a user attempts to search or browse this collection, they are prompted to agree to an End User Copyright Agreement that does not require them to log in. The only purpose of this EUCA is to make the user agree to the terms presented. This information could just as easily be presented at the page level, especially considering the user doesn't have to authenticate.
Items in this collection have a piece of metadata called "Link to Larger Image". This link points to a separate web server that resides on LIBX. The sole purpose of this web server is to authenticate users, determine what authentication group they belong to, and then, if they are in the appropriate group, serve up the appropriate large-scale image.
There is nothing preventing an anonymous user from accessing the collection, regardless of whether or not they agree to the EUCA.
Because the current solution is based solely on server-side components, no code changes are necessary. Instead, a second web server must be set up (whether or not it resides on LIBX is another issue) that prompts users for their BSU credentials and then serves up images based on the authenticated user's permissions.
The only other minor change would be to either modify the code so that the EUCA information appears on each item's page, provide the EUCA on the second web server (so that it won't appear until a user tries to view a larger image), or modify the copyright metadata to reflect it (or simply leave the metadata as it is).
Backwards-compatible linking so old links (from 3rd party sites) still function correctly.
All standard links to searches, collections, etc., have been designed to be backwards compatible in the latest version of CONTENTdm. Once the upgrade is complete, if users click an old link to LIBX, it will still work in most circumstances.
However, because of the custom front-end that emulates "sub-collections" (canned searches based on a few particular pieces of metadata), "sub-collection" links will be broken after the upgrade. Here is an example of a collection link that will work after the upgrade. Here is an example of a "sub-collection" link that will not work. As can be seen in the same link modified for the Test Server, the backwards-compatible auto-redirect mechanism still detects the "sub-collection" as a type of landing page, but because, in CONTENTdm, there is no such construct as a sub-collection, it produces an error.
There are a few options depending on how much code we want to change.
The easiest option would be to alter the text of the standard error page to be more helpful, perhaps even providing information as to why the link is broken, along with links to the collection itself, and information on how the user can change their links to work in the future.
Another option would be to alter the code used by the "landingpage" view to detect aliases that look like sub collections, and automatically forward the user to the main landing page of that collection (since, as mentioned below, "sub-collection" links will be listed on the main landing page for their collection). Of course, this option has a few drawbacks: users won't be actively encouraged to fix their broken links; there will be some significant code alteration involved; this customization will need to be re-implemented with every upgrade.
Sub-collections and Custom Search Boxes
While sub-collections are not a truly self-contained entity in CONTENTdm, and CONTENTdm does not recognize or support the notion of sub-collections, there is definitely a recognized need for a method of providing users with the ability to both browse and search sub sets of various collections.
Sub-collections have been added to the DMR through a major overhaul to the front-end website created by Budi Wibowo. The custom front-end can basically be considered a content management system layer resting on top of the actual CONTENTdm installation. In order to take advantage of the many new features provided by the out-of-the-box front-end, this custom front-end will not be directly carried forward with the new version, but its features and functionality will.
Links to sub-collection landing pages are discussed in the Linking section above.
An example of a collection with sub-collections is the U.S. Civil War Resources for East Central Indiana collection. It includes four sub-collections with their own descriptions and search boxes. However, because sub-collections are not an actual feature of CONTENTdm, they have been implemented by either simply directly linking to the items within the sub-collection and/or providing search and browse functionality across a subset of items in the collection that have been assigned specific pieces of metadata.
For example, the U. S. Civil War Resources from the United States Vice Presidential Museum at the Dan Quayle Center sub-collection isn't really a sub-collection per se, but rather a filtered search within the parent collection that only looks for items containing the string "Dan Quayle Center. United States Vice Presidential Museum" in their metadata.
This collection has been further broken down into "sub-sub-collections", such as the John Cabell Breckinridge Collection, which is really just another filtered search within the parent collection that only looks for items containing the string "John Cabell Breckinridge Collection" in their metadata.
The search boxes in these sub-collections have been pre-formatted to only search the filtered results associated with their sub-collection.
Due to problems with linking (mentioned above), sustainability in future versions of CONTENTdm, and the fact that sub-collections are not an actual supported type of collection in CONTENTdm, sub-collections will be implemented in a drastically different way in the new DMR.
While we can continue to refer to them publicly as "sub-collections" so as to not confuse end-users, developers and content creators should understand that, moving forward, sub-collections are simply canned searches on specific sub sets of items with certain metadata values.
Rather than redeveloping the "content management system" from the current system, we will be providing links, descriptions, and search boxes for all sub collections on their parent collection's landing page. An example can be seen in the U.S. Civil War Resources for East Central Indiana Collection on the Test Server. While not complete, this landing page should serve as an example for all future collections that contain sub-collections. Each sub-collection is given a link that, once clicked, will uncover details, links, and search boxes for the sub-collection and any sub-collections that reside within it.
This change does not require any changes to the underlying code of CONTENTdm, but rather a shift in the paradigm of how we think of "sub-collections". Most, if not all, of the canned searches that already exist in the DMR Server can be directly copied and pasted into landing pages, and the search boxes are very simple to construct and copy.
Providing in-line audio/video resources via MediaSite.
Several collections have objects which contain audio and/or video files. In the past, CONTENTdm's out-of-the-box features that provide in-line media players for these objects was not used. Instead, a decision was made to place all multimedia in the university's MediaSite system, a central storage server that provides various streaming features.
Items in the DMR as associated with items in MediaSite through an internal metadata field called "Media Identifier", which contains the URL for the item's particular video/audio stream.
MediaSite does not provide direct access to streaming content, therefore the current solution involves loading a web page that contains the appropriate stream, and positioning that page within an HTML iframe so that only the stream itself (and the controls) appear. This iframe is then placed at the top of the item's page so that it appears as though the media is being directly streamed in-line on the web page.
Not all collections use this method. For example, the Indiana ArtsDesk Radio Archive Collection contains links to WMA files hosted on the DMR server. The Dolls Collection contains 3D QuickTime MOV files that are hosted on the DMR server. The Ball State University Student Art Collection contains direct links to MOV files that are hosted on the DMR server. And the Musical Instruments Collection contains compound objects with WMA files that are hosted on the DMR server. Recently, compound objects in some collections have been given the in-line media treatment as well.
In testing, the newest out-of-the-box in-line media player in CONTENTdm simply doesn't meet our needs. A MediaSite stream can't be associated with it and it appears to be inconsistent with its ability to play back video and audio files. It also mistakenly recognizes 3D Object MOV files as video files and attempts to play them, which doesn't work.
The original customization will be carried forward in almost exactly the same form (with some minor improvements). The Media Identifier metadata field will continue to be both an indicator and source for the in-line media that appears on an object's view.
Because there are still some older collections (such as those linked above) that do not use MediaSite, the current plan is for LITS to provide these media files (along with identifying information) to MADI. To be consistent with our current mandate to place all audio/video in MediaSite, MADI will then place these items in MediaSite and add the appropriate Media Identifier to each object.
A special consideration must be made for 3D Object MOV files, as the new version of CONTENTdm recognizes them incorrectly as videos. Either code will need to be changed so that it recognizes them correctly, or these items will need to be added as URLs and be hosted elsewhere.
The relevant code for adding in-line media to single items is in OCLC\CONTENTdm\Content6\website\cdm_common\cdm\views\scripts\cdm\singleitem.phtml around line 232. There is also a minor change around line 429 that will hide the content_main DIV if an item contains a residual link to a stream (or an outdated URL). This customization has been commented out, as it may not be needed in the long run.
The relevant code for adding in-line media to compound objects has not yet been identified, but the code should be similar.
Using Google Analytics to more easily track usage and get page view statistics for each collection.
As "messy" as the code is, it gets the job done and Google Analytics is able to track just about every page view from the DMR. However, Google Analytics itself, while being a very useful and powerful tool, doesn't provide an easy way for us to differentiate hits between different collections. To do that, Alex Lemann created a Python script, currently housed on one of LITS' Linux servers, that grabs Analytics data from a given date range and produces distilled numbers for each individual collection (based on a list of collection aliases). This script must be manually run on a monthly basis.
While LITS has obtained administrative access to the DMR Analytics account, it is still listed under Budi Wibowo's personal account. This isn't a problem, but it makes for inconsistencies and LITS would prefer it if personal accounts were not used to set up these accounts in the future (instead, a general LITS account should be used, and personal access can be set up later).
On that note, it also appears that the Python script is connecting to Analytics with the litstablet at gmail account, while Alex also had access to it through an ablemannbsu at gmail account. In the future, all Google Analytics usage for the DMR will be housed under one general LITS account.
During the upgrade process we will be able to streamline and possibly improve our usage of Google Analytics, as well as make it a sustainable customization that will be easily carried forward with future version of CONTENTdm.
One problem with carrying this change forward is that URLs in the new version look different. For example, the Musical Instruments collection on the DMR Server uses the following URL: http://libx.bsu.edu/cdm4/collection.php?CISOROOT=/MusInst, while the Test Server uses: http://libcdmtest.dhcp.bsu.edu/cdm/landingpage/collection/MusInst. The actual page used (collection.php) is hidden and parameters now use a new "clean" URL. This won't negatively affect statistics, but it will mean that historic data will not directly match new data. It also means that the Python script we are currently using to glean collection-by-collection stats may be affected.
Because of this, the proposed solution for Google Analytics is to:
- Create a new Analytics account for the new version of the DMR under the firstname.lastname@example.org account (under which several other Analytics accounts are housed).
- Modify the monthly Python script to work with the new URLs if necessary OR totally replace it with a web-based script that can be run from anywhere (this would be preferable, as then ASC could run the script for any date range without going through the extra step of requesting it from LITS).
There are several applications that use the search API provided by CONTENTdm. These applications have been developed both internally and by other parties not directly associated with the University Libraries. They will need to be tested and updated as necessary. Below is a list of known applications that use the API and any relevant details and contact info.
- BSU Maps Surface App
- Contact: John Godsey, LITS Student
- BSU Photos Surface App
- Contact: John Godsey, LITS Student
- Museum of Art Surface App
- Contact: John Fillwalk (John Straw and Jim Bradley)
- Primarily uses the David Owsley Museum of Art Collection. This collection has been moved to the Test Server so that the developers of this application can test it and update if needed.
- SecondLife and Blue Mars projects
- Contact: Jim Connolly and John Fillwalk (John Straw and Jim Bradley)
- What Middletown Read?
- Contact: Jim Connolly (John Straw)
In an email from Jim Connolly (which was also copied to John Fillwalk) sent on 3/24/2013, he said "I don't forsee any problems with the Virtual Middletown project, though I am checking with John Fillwalk to be sure. John Straw is in a better position to answer re: the Second Life work we did. I think it does use some DMR assets. As for the WMR material, just let me know if it has an effect."
In an email from John Fillwalk (which was copied to Ina-Marie Henning) sent on 3/22/2013, he said "We will look into this and get back to you thanks for letting us know."
It sounds like the WMR project won't be impacted but that still need to be checked for any references to LIBX. We're still waiting to hear back from John Fillwalk about any of the others.
General Usability Concerns
- Repository- and Collection-level Searching
- There is a built-in difference between performing a search on multiple (or all) collections vs. searching a single collection. Once a user gets to the single-collection level, advanced search features automatically adjust to search only the defined metadata field associated with that collection.
- While this shouldn't be a problem, it was made very obvious and caused some user confusion on the BoT Server (as users who thought they were searching the BoT collection were in fact searching "all" collections, which the system treats as more than one, even if only one collection is published).
- There are several factors that work in our favor that will make this search descrepancy less of an issue. The homepage customization (using the design from Outside Source) will give us complete control over the first search method users encounter. Also, the current DMR Server also has these automated features already.
- Experienced and savvy users who are used to our advanced search page will already be more than accustomed to the differences between the two levels of searching. While average users who aren’t necessarily interested in all of the advanced search features provided to them won’t be confused or overwhelmed by the simple one-box interface.
- Date-based Searching and Metadata Considerations
- Another usability issue, which was brought to our attention from a user trying to search on the BoT Server, is related to how we set up metadata fields in each collection. It is also related to the repo- and collection-level searching issue mentioned above.
- When a user performs a date search (from any level), CONTENTdm searches the selected collections based on any metadata fields that have been assigned (by us) to those collections as well as mapped to one of the dc.date fields. This has proven to be problematic for users when the "Date-Available" or "Digital Date" field has been mapped to any dc.date field. Most users do not assume that a general date search will return results containing dates related to when an item was entered into the CONTENTdm system. Instead, they are expecting dates related to the original context of the item.
- In order to correct this issue, while still retaining the ability for advanced users to search across the "Digital Date" field if needed, we must unmap the "Digital Date" field from any dublin core Date field. This, of course, could prevent this date from showing up in things such as APIs or Harvest Points, but it would always be available in the DMR as a searchable field.
- The other issue is that, when looking at the defined metadata fields for each collection in the DMR, date fields have been set up as "Text" fields, rather than "Date" fields. While the impact of this is somewhat minor, setting up date fields as "Date" types vastly improves the ability of the system to sort and search those fields. Changing this setting does not appear to adversely affect the collection, and would only require someone to go through each collection and change the type of each date field, then re-index the collection.
- In-Document Searching, Full-Text, and the "Text" tab
- As a result of the Summon/OneSearch project, the current plan is to go through all of our collections and unhide any Full-Text fields as well as possibly map them to some standard Dublin Core field (so that they appear in Harvest Points).
- Unhiding these fields before the upgrade will result in very ugly pages, among other issues, in the current DMR Server.
- Many PDF/compound objects have both an Image or PDF tab above the actual iframe that shows the item, and next to that, a Text tab. The purpose of the Text tab is to show whatever is in the full-text metadata field for that object. By default, however, if the full-text metadata field is hidden (via the administrative interface) the Text tab still shows up, but doesn't display anything.
- By default there is also a standard "Text Search" feature that appears next to the Text tab and allows users to search the full-text of the item they're looking at. This only appears when full-text metadata is not hidden for the collection.
- Full-text metadata also appears under the "Description" area near the bottom of the page if full-text is set as not hidden. This looks messy with items that have a lot of text.
- We would like to still provide the "Text Search" feature, but we don't want the Text tab to appear in any circumstance (as it simply contains jumbled and strangely formatted plain text that may confuse users), and we would like to prevent the full-text from appearing in the "Description" area of the page.
- Preventing it from appearing in the "Description" area is as simple as setting the "Hide full text" in the Website Configuration Tool.
- In summary, to allow in-document searching we must set full-text metadata fields in the administrative interface to not hidden. To hide the newly unhidden full-text metadata field from showing up near the bottom of the page we must change the full-text setting in the Website Configuration Tool. To get rid of the Text tab, we must make some small modifications to the code, detailed below.
Relevant code for hiding the "Text" tab can be found in OCLC\CONTENTdm\Content6\website\cdm_common\cdm\views\scripts\cdm\compoundobject.phtml around line 378. Commenting the li item out works, and as long as full-text is not hidden, the "Text Search" feature will still appear. Another minor modification to the file around line 537 must be made to comment out "tab window 2" to prevent the contents of the now-hidden tab from appearing.
This tab also appears in "single item view" and will need to be edited out accordingly. Relevant code can be found in OCLC\CONTENTdm\Content6\website\cdm_common\cdm\views\scripts\cdm\singleitem.phtml around line 243, with the corresponding tab content div around line 372.
Browse by Subject, Location, Format, and Contributors
Several navigation links that, when clicked, present the user with a list of collections organized by those fields.
Class B (Important but will not prevent upgrade)
Outside Source Front-End Design
Spotlight Collections on Homepage
Custom Header for David Owsley Museum of Art Collection
The DMR Server has a small customization that displays a different banner when the user is browsing or searching the David Owsley Museum of Art Collection.
The Museum of Art would like to see this customization carried forward. We may modify it a bit to make it fit better with the new design of the web site, however. Jim Bradley has informed them that it may disappear and come back in a different form until we speak with them about some other graphical modifications they were interested in.
Class C (Will NOT be implemented)
- Hide Header