Cart (0)
  • No items in cart.
Total
$0
There is a technical issue about last added item. You can click "Report to us" button to let us know and we resolve the issue and return back to you or you can continue without last item via click to continue button.
Search book title
Enter keywords for book title search
Search book content
Enter keywords for book content search
Filters:
FORMAT
BOOKS
PACKAGES
EDITION
to
PUBLISHER
(1)
(326)
(572)
(44)
(234)
(969)
(652)
(2114)
(64)
(92448)
(54)
(535)
(117)
(33)
(20)
(19)
(93277)
(3)
(17)
(1)
(351)
(300)
(6217)
(239)
(16)
(5)
(1621)
(16)
(19)
(28)
(4)
 
(6)
(7)
(115)
(3)
(57)
(5)
(5)
(1)
(1)
(2)
(23)
(26)
(27)
(13)
(61)
(24)
(22)
(7)
(8)
(20)
(1)
(3)
(50)
(6)
(31)
CONTENT TYPE
 Act
 Admin Code
 Announcements
 Bill
 Book
 CADD File
 CAN
 CEU
 Charter
 Checklist
 City Code
 Code
 Commentary
 Comprehensive Plan
 Conference Paper
 County Code
 Course
 DHS Documents
 Document
 Errata
 Executive Regulation
 Federal Guideline
 Firm Content
 Guideline
 Handbook
 Interpretation
 Journal
 Land Use and Development
 Law
 Legislative Rule
 Local Amendment
 Local Code
 Local Document
 Local Regulation
 Local Standards
 Manual
 Model Code
 Model Standard
 Notice
 Ordinance
 Other
 Paperback
 PASS
 Periodicals
 PIN
 Plan
 Policy
 Product
 Program
 Provisions
 Requirements
 Revisions
 Rules & Regulations
 Standards
 State Amendment
 State Code
 State Manual
 State Plan
 State Standards
 Statute
 Study Guide
 Supplement
 Technical Bulletin
 All
  • BSI
    BS ISO 24614-1:2010 Language resource management. Word segmentation of written texts - Basic concepts and general principles
    Edition: 2010
    $411.69
    / user per year

Description of BS ISO 24614-1:2010 2010

This part of ISO 24614 presents the basic concepts and general principles of word segmentation, and provides language-independent guidelines to enable written texts to be segmented, in a reliable and reproducible manner, into word segmentation units (WSU).

NOTE 1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean.

The many applications and fields that need to segment texts into words — and thus to which this part of ISO 24614 can be applied — include the following.

Translation

Word count is the principal method for calculating the cost of a translation. Word segmentation is a standard function in translation memory systems and computer-assisted translation (CAT) tools. Word segmentation is performed by term extraction tools, which are sometimes provided in terminology management systems and CAT tools.

Content management

Most content management systems and databases allow for searching by individual words. The content being searched has to be segmented to permit matching with a search word. Furthermore, search functions require knowledge of the boundaries of words.

Speech technologies

Text-to-speech systems generate speech based on words and therefore require word segmentation for lexicon lookup, stress assignment, prosodic pattern assignment, etc.

Computational linguistics

Various natural language processing (NLP) systems must segment text into words in order to carry out their functions. NLP systems include

  • morphosyntactic processors,

  • syntactic parsers,

  • spellcheckers,

  • text classification systems, and

  • corpus linguistics annotators.

Lexicography

Lexical resources are often evaluated by size, usually by referring to the number of words.

NOTE 2 The size of language resources is an essential benchmark for their management. Quantifying the size of language resources is typically achieved by counting the words. However, because NLP applications use different segmentation methods, each calculates the number of words differently and arrives at a different sum for the same text. A reliable, reproducible, standard measure would allow comparable results. This is not to say that applications may not use their own, application-specific segmentation methods. For example, a speech synthesis application might segment a text into smaller or larger units compared to another application.



About BSI

BSI Group, also known as the British Standards Institution is the national standards body of the United Kingdom. BSI produces technical standards on a wide range of products and services and also supplies certification and standards-related services to businesses.

X