Word segmentation is the process of identifying word boundaries in a given text. It is often used as preprocessing step in the context of many NLP tasks such as linguistic annotation, speech recognition, information extraction, machine translation, etc.
Although the area of word segmentation processing has been relatively well researched in recent years, it remains a nontrivial problem. The complexity results from the fact, that not all languages, such as English, German, French and other languages using some form of the Latin or Cyrillic alphabet, have importers and certain criteria for a word delimiter such as space character. The languages such as Chinese, Japanese or Thai, which have an ideographic writing system, pose a particular problem in word segmentation. There is no word delimiter between words in written sentences.
The goal of the multiple part standard ISO 24614 is to standardize the word segmentation in a written language whose word boundaries cannot be fully identified by typographic properties.
- Corpus Encoding Standard
- Darwin Information Typing Architecture
- Guidelines for Electronic Text Encoding and Interchange
- Journal Article Tag Suite
- Language Resources Management — Multilingual Information Framework
- Language resource management — Linguistic annotation framework
- NLM Journal Archiving and Interchange Tag Suite
- Segmentation Rules eXchange
The first part of the standard 24614-1:2010 describes the basic concepts and general principles of word segmentation. The standard offers language-independent requirements, recommendations and guidance for segmentation of the written texts into word segmentation units (WSU) for monolingual and multilingual information processing. It describes the basic framework for word segmentation as well. All substantial characteristics, principles, components and processes of word segmentation are described accurately and in detail in this part of the standard.
The second part of the standard is language-specific and deals with Chinese, Japanese and Korean languages. The standard defines word segmentation unit (WSU) for Chinese, Japanese and Korean, provides the general rules for identifying WSU in the above mentioned languages and specifies rules for identifying WSU in each of these three languages as well.
In this manner, the standard gives a review of the similarities and differences between the languages and additionally explains its implications on the word segmentation and describes it in practical guidelines.
The scope of the third part of the standard is to describe the South and Southeast Asian languages and to specify the general rules for their word segmentation.
Legend: | |
|
isApplicationOf |
|
uses |
|
isVersionOf |
|
isPartOf |