The start of the new academic year has seen students and staff frustrated by recurring disruptions to the Lancaster University Virtual Learning Environment (LUVLE).
The Learning Technology Group (LTG), part of the University’s Corporate Information Services (CIS), have been working to minimise disruption and identify the root cause over the last two weeks, and a potential solution has now been implemented with positive results.
Problems with LUVLE began on October 6, with 75 server crashes occurring on the system in the period up to October 20. These crashes were causing the system to be unavailable for periods usually between five and 20 minutes long, meaning students were unable to access important information during the first week of their courses. Overloading of the system, understandable at this point in the year, has been linked to these issues, although this has not been the ultimate cause.
The introduction of new servers, intended to spread the workload across LUVLE, has since reduced the amount of disruption. ‘[T]he number of crashes which have occurred each day has been lower since the new server [has been] up and running’, reported Head of CIS Andrew Meikle, who is overseeing the issue.
Whilst working to stem the flow of crashes, technicians from LTG and Information Systems Services (ISS) have been investigating the underlying cause of the problem, which has proved problematic. Crashes have been caused by “memory leaks”, which occur when server memory is not released for re-use. “Ultimately, when the server runs out of memory altogether, it crashes,” said Meikle.
Using a new monitoring mechanism the LTG have been able to monitor memory usage and foresee crashes, which can then be staved off by restarting components of the LUVLE servers.
This short-term measure still had an impact upon LUVLE availability, said Meikle, but this impact “can be measured in a matter of a few seconds, as opposed to a period of recovering from a crash which can be anything from five to twenty minutes.”
“The memory leak in LUVLE causes a crash; we are looking for the root cause of the memory leak, and believe that we have [found this]” Meikle asserted. However, it is not in the LTG’s power to stop these memory leaks causing crashes. Only IBM, the business solutions company who provide the software which LUVLE is built upon, have this ability.
LTG have been working closely with IBM, who provided a “hotfix” on October 22. According the LTG blog, IBM believe that this “will reduce the impact of the memory leak and thus stop the service outages”. The “hotfix” is now being tested and gradually implemented.
Concern has grown over the effects of LUVLE outages upon students, particularly in departments which rely more heavily on regular interaction via LUVLE, such as the Department of English and Creative Writing. No members of staff from the department were available for comment, but one second-year English and Creative Writing student, who chose not to be named, expressed the importance of LUVLE to her course.
“[Creative Writing] students depend on it highly, due to the fact that it’s the only way we can share our work and have our classmates read and critique it before a workshop. It also serves as an archive you can run back to when you’re putting together your portfolio, or need to re-read your classmates’ work in order to write critiques for them” she said.
For these reasons, the University has now acknowledged that many students have been inconvenienced by this issue. Gavin Brown, Director of Undergraduate Studies, and Acting Chair of the Information Technology Policy Committee, has noted that “no student would be disadvantaged by late submission of assessed work in week one”
.
As the issue has developed over the past two weeks, the LTG have been keen to communicate progress with LUVLE users. “We are providing regular updates to all users via Message of the Day and a linked web page, and writing to Heads of Department and Departmental Administrators regularly to keep them updated”, stressed John Gallagher, Director of ISS.
Updates have taken the form of a blog available from the ResNet homepage, which carries the following apology: “The Learning Technology Group apologises unreservedly for this inconvenience and has been working hard to resolve the underlying cause(s) and mitigate the effects on LUVLE users.”
The LTG have also been keen to stress that “we are [investigating] in such a way as to minimise the on-going impact of the server instability”, said Meikle. One way of doing this has been artificially generating load on test servers, which has allowed solutions to be tested without further affecting the actual LUVLE servers.
The LTG continue to work closely with IBM to find and resolve the root cause of the issue. At the time of writing, LUVLE has been restored to availability between 99.3 and 99.9% of the time.
Several useful lessons have been learnt:
1. Always load test a product to the level of load expected (and beyond) before it is used.
2. Always listen to user feedback – LUVLE was in use last year.
3. Don’t use a centralised service for mission critical applications unless you can demonstrate parts of it can fail safely.
4. Always have a backup.
5. Realise that students are now customers. Some pay directly. The rest bring in research grants.
6. Its a miracle this hasn’t become a national news story…