Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
Language resource management — Word segmentation of written texts
Abbreviation: WordSeg [not official, only for reference in this website]
Scope: Standard for word segmentation of written texts
Topic: Generic Corpus Annotation, Segmentation
Standard body: ISO
Keywords: tokenization, word segmentation, word, segmentation, word segmentation unit
Description:

Word segmentation is the process of identifying word boundaries in a given text. It is often used as preprocessing step in the context of many NLP tasks such as linguistic annotation, speech recognition, information extraction, machine translation, etc.

Although the area of word segmentation processing has been relatively well researched in recent years, it remains a nontrivial problem. The complexity results from the fact, that not all languages, such as English, German, French and other languages using some form of the Latin or Cyrillic alphabet, have importers and certain criteria for a word delimiter such as space character. The languages such as Chinese, Japanese or Thai, which have an ideographic writing system, pose a particular problem in word segmentation. There is no word delimiter between words in written sentences.

The goal of the multiple part standard ISO 24614 is to standardize the word segmentation in a written language whose word boundaries cannot be fully identified by typographic properties.

Other standards in the same topic(s):

Part title: Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles
Abbreviation: WordSeg-1
Description:

The first part of the standard 24614-1:2010 describes the basic concepts and general principles of word segmentation. The standard offers language-independent requirements, recommendations and guidance for segmentation of the written texts into word segmentation units (WSU) for monolingual and multilingual information processing. It describes the basic framework for word segmentation as well. All substantial characteristics, principles, components and processes of word segmentation are described accurately and in detail in this part of the standard.

Abbreviation: WordSeg-1-2010 [not official, only for reference in this website]
Version Number: ISO 24614-1:2010
Status: International Standard
Release Date: 2010-10-25
Editor:
  1. ISO/TC 37/SC 4/WG 2 (since 2011, WG 6)
URL(s): http://www.iso.org
Related Standard(s):
  • DCR-2009

    should be used in close conjunction with

  • LMF-2012

    WordSeg-1 should be used in close conjunction with the LMF

  • MAF-2012

    should be used in close conjunction with

  • TMF-2009

    should be used in close conjunction with the TMF

Part title: Language resource management — Word segmentation of written texts — Part 2: Word segmentation for Chinese, Japanese and Korean
Abbreviation: WordSeg-2 [not official, only for reference in this website]
Description:

The second part of the standard is language-specific and deals with Chinese, Japanese and Korean languages. The standard defines word segmentation unit (WSU) for Chinese, Japanese and Korean, provides the general rules for identifying WSU in the above mentioned languages and specifies rules for identifying WSU in each of these three languages as well.

In this manner, the standard gives a review of the similarities and differences between the languages and additionally explains its implications on the word segmentation and describes it in practical guidelines.

Abbreviation: WordSeg-2-2011 [not official, only for reference in this website]
Version Number: ISO 24614-2:2011
Status: International Standard
Release Date: 2011-08-25
Editor:
  1. ISO/TC 37/SC 4/WG 6
URL(s): http://www.iso.org
Part title: Language resource management — Word segmentation of written texts — Part 3: Thai, Hindi, Vietnamese, and other related languages
Abbreviation: WordSeg-3 [not official, only for reference in this website]
Description:

The scope of the third part of the standard is to describe the South and Southeast Asian languages and to specify the general rules for their word segmentation.

Abbreviation: WordSeg-3-2004 [not official, only for reference in this website]
Version Number: ISO/PWI 24614-3
Status: Preliminary Work Item
Release Date: 2004-11-05
Editor:
  1. ISO/TC 37/SC 4/WG 6
URL(s): http://www.iso.org
Relations
Legend:

isApplicationOf

uses

isVersionOf

isPartOf