The localization and internationalization of software has become a major activity of companies who have a significant presence in the international software market. In most cases they do an excellent job, usually taking a language to language translation approach to the problem. Software is translated much as a book might be translated. Machine translation might be used to facilitate this. Of course software is more complex; only the text parts of the software need to be translated, and the translated text may occupy a different number of characters, require a different area on the screen or printed page.
This constrains the translation. But the text is about a narrow domain - the software itself - and that makes the translation easier.
The approach of GLOSSASOFT is more general. We view the problem as one of both software technology and linguistics. It is a linguistic problem because the translation requirement is not simply the substitution of one body of text by another. During execution several pieces of translated text may need to be brought together and composed - the result must look natural to a native speaker. In developing international software we need to be able to indicate the required text in a neutral way (frequently this is done with a 'message number') and extract this at run time. This represents the intended meaning. Producing the message at run time is a problem of language generation, given the elements of meaning and the rules of composition.
It is also a linguistic problem because many software packages capture and manipulate text that has been supplied by the users. Examples of this are word processors and database management systems. In using these packages we frequently are required to match text. What constitutes an acceptable match depends upon the language. We frequently ask for text to be sorted - sort orders are language and culture specific. Software embeds assumptions very deeply - for example, hashing algorithms will be constructed with the statistical properties of a particular corpus of words or names in mind.
It is a software technology problem because we must be able to organize the software so that the linguistic components are isolated and can easily be replaced. This leads to the consideration of how standard software packages like window management systems, word-processors and database management systems are constructed - where the assumptions about a particular natural language and culture are embedded. We will have to propose extensions to current best practice so that the linguistic and cultural assumptions are factored out. We are also led into more general approaches to software construction, involving software reuse and componentry and the programming languages that are used to describe components and their interconnection. We will be considering reusable linguistic components that can be deployed as required in the construction of software packages.
The project plan is, then, to go back to basics, and review both software structures and linguistics from this perspective. Through doing that we will extract general principles of internationalization and localization of software. We will evaluate these on a number of case studies in Ireland, Greece and Finland, and evolve and extend the guidelines we have developed. At the end of the project the guidelines will be published. We aim to go beyond current best practice, and make our findings widely available throughout the European Community.
The project has been split into four phases, each lasting six months. Phase one of the project consists of two parallel studies. The first one concerns linguistics and Natural Language Processing technologies (Natural Language Analysis & Generation, Machine Translation Tools, Resources, Knowledge Representation Models) and the second one investigates current software and system architectures (existing types of systems from user interfaces and window managers, database systems to spreadsheets and desk top publishing; aspects of software reuse) that are appropriate for software internationalization. Links have been made with relevant organizations/associations and projects (LISA, EAMT, ...) in order to improve and promote the project's results in the current and subsequent phases. A Glossasoft Interest Group has been formed consisting of organizations who are particularly interested in research in this area and who wish to contribute to and share in the achievements of the consortium.
In the second phase a first version of guidelines, methods and tool specifications were produced. In the case studies of the third phase the produced guidelines, methods and tool specifications were evaluated. During these case studies we outlined a few prototype tools to assist the localization process. There are three case studies on applications originally developed in English:
In the fourth and final phase we have finalized the development of the guidelines, methods and tool specifications, incorporating the experience gained in the case studies, adding deeper features of software internationalization and localization. At the end of this phase a dissemination program and seminars will be arranged, in order to make the project's results widely available, promoting their adoption by the software companies and relevant organizations/associations.