Language resource management — Word segmentation of written texts

Abbreviation: WordSeg [not official, only for reference in this website]

Scope: Standard for word segmentation of written texts

Topic: Generic Corpus Annotation, Segmentation

Standard body: ISO

Keywords: tokenization, word segmentation, word, segmentation, word segmentation unit

Description:

Word segmentation is the process of identifying word boundaries in a given text. It is often used as preprocessing step in the context of many NLP tasks such as linguistic annotation, speech recognition, information extraction, machine translation, etc.

Although the area of word segmentation processing has been relatively well researched in recent years, it remains a nontrivial problem. The complexity results from the fact, that not all languages, such as English, German, French and other languages using some form of the Latin or Cyrillic alphabet, have importers and certain criteria for a word delimiter such as space character. The languages such as Chinese, Japanese or Thai, which have an ideographic writing system, pose a particular problem in word segmentation. There is no word delimiter between words in written sentences.

The goal of the multiple part standard ISO 24614 is to standardize the word segmentation in a written language whose word boundaries cannot be fully identified by typographic properties.

Other standards in the same topic(s):

Part title: Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles

Abbreviation: WordSeg-1

Description:

The first part of the standard 24614-1:2010 describes the basic concepts and general principles of word segmentation. The standard offers language-independent requirements, recommendations and guidance for segmentation of the written texts into word segmentation units (WSU) for monolingual and multilingual information processing. It describes the basic framework for word segmentation as well. All substantial characteristics, principles, components and processes of word segmentation are described accurately and in detail in this part of the standard.

Abbreviation: WordSeg-1-2010 [not official, only for reference in this website]

Version Number: ISO 24614-1:2010

Status: International Standard

Release Date: 2010-10-25

Editor:

ISO/TC 37/SC 4/WG 2 (since 2011, WG 6)

URL(s): http://www.iso.org

Related Standard(s):

SpecDcr-ISDCR-2009
should be used in close conjunction with
SpecLMF-ISLMF-2012
WordSeg-1 should be used in close conjunction with the LMF
SpecMaf-ISMAF-2012
should be used in close conjunction with
SpecTMF-ISTMF-2009
should be used in close conjunction with the TMF

Part title: Language resource management — Word segmentation of written texts — Part 2: Word segmentation for Chinese, Japanese and Korean

Abbreviation: WordSeg-2 [not official, only for reference in this website]

Description:

The second part of the standard is language-specific and deals with Chinese, Japanese and Korean languages. The standard defines word segmentation unit (WSU) for Chinese, Japanese and Korean, provides the general rules for identifying WSU in the above mentioned languages and specifies rules for identifying WSU in each of these three languages as well.

In this manner, the standard gives a review of the similarities and differences between the languages and additionally explains its implications on the word segmentation and describes it in practical guidelines.

Abbreviation: WordSeg-2-2011 [not official, only for reference in this website]

Version Number: ISO 24614-2:2011

Status: International Standard

Release Date: 2011-08-25

Editor:

ISO/TC 37/SC 4/WG 6

URL(s): http://www.iso.org

Part title: Language resource management — Word segmentation of written texts — Part 3: Thai, Hindi, Vietnamese, and other related languages

Abbreviation: WordSeg-3 [not official, only for reference in this website]

Description:

The scope of the third part of the standard is to describe the South and Southeast Asian languages and to specify the general rules for their word segmentation.

Abbreviation: WordSeg-3-2004 [not official, only for reference in this website]

Version Number: ISO/PWI 24614-3

Status: Preliminary Work Item

Release Date: 2004-11-05

Editor:

ISO/TC 37/SC 4/WG 6

URL(s): http://www.iso.org

Relations

Legend:
	isApplicationOf
	uses
	isVersionOf
	isPartOf

Home
Centres
Format Recommendations
	Data Deposition Formats
	Functional Domains
	File Extensions
	Media Types
	Statistics
		Popular Formats
		Relevant KPIs
	Sanity Check
		Keywords
Standards and Specifications
	Standard Bodies
	Topics
	Search
API
About / F.A.Q.