Advanced search
Start date
Betweenand

Fault tolerance in large-scale computational grids

Grant number: 06/04976-9
Support Opportunities:Scholarships in Brazil - Post-Doctoral
Start date: January 01, 2007
End date: November 12, 2007
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computer Systems
Principal Investigator:Fabio Kon
Grantee:Fernando José Castor de Lima Filho
Host Institution: Instituto de Matemática e Estatística (IME). Universidade de São Paulo (USP). São Paulo , SP, Brazil

Abstract

In spite of the fast evolution of Grid Computing in the last decade and its use to solve real, computationally expensive problems, many challenges still remain to be overcome before this technology can be universally employed. One of these challenges is to guarantee that large-scale grids keep on functioning efficiently when some of their nodes fail, in particular, the nodes responsible for managing the infrastructure of the grid. Failures of such nodes can compromise the functioning of the whole grid. Some recent studies indicate that infrastructure problems are among the most common problems faced by users of computational grids. Usually, the users are unable to solve such problems, since: (i) they are specialists in the domains of the applications they submit to the grid, not on its management; and (ii) in large scale grids, potentially comprising thousands of nodes, dozens of nodes can fail simultaneously, which makes it infeasible to manually manage the grid. This research project aims to investigate new protocols, mechanisms, and algorithms for the construction of a fault-tolerant and autonomic infrastructure for the execution of applications on large-scale computational grids.

News published in Agência FAPESP Newsletter about the scholarship:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)

Scientific publications
(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)
CASTOR FILHO, FERNANDO; ROMANOVSKY, ALEXANDER; RUBIRA, CECILIA MARY F.. Improving reliability of cooperative concurrent systems with exception flow analysis. JOURNAL OF SYSTEMS AND SOFTWARE, v. 82, n. 5, p. 874-890, . (06/04976-9)
CASTOR FILHO, FERNANDO; GARCIA, ALESSANDRO; RUBIRA, CECILIA MARY F.; IEEE. Extracting error handling to aspects: A cookbook. 2007 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, v. N/A, p. 2-pg., . (06/04976-9)