Watson Information Encoding [closed]

3

IBM Watson contiene molte informazioni sui libri codificate in un "database" che Watson ricerca in tempo reale . Qualcuno sa come sono codificate queste informazioni? Sembra difficile immaginare come gli umani possano digitare tutte queste regole.

    
posta durron597 01.04.2011 - 19:23
fonte

1 risposta

8

Il cuore di Watson è il software IBM DeepQA. Troviamo alcune risposte su Domande frequenti :

Q: What data is stored in Watson?

A: All of Watson's data will be self-contained. Watson will perform without a connection to the web or any external resource. The vast majority of Watson's data will be a wide variety of natural language text. Some structured (formal knowledgebase's) and semi-structured data (tagged text) is also included mostly to help interpret text and refine answers. Exactly which data will be used for competing on Jeopardy! will be revealed at a later date, but the specific content and how to analyze and manage it are part of the research agenda

Q: Does DeepQA use UIMA?

A: Yes. UIMA is a standard framework for building applications that perform deep analysis on unstructured content, including natural language text, speech, images and video. IBM contributed UIMA to open-source (see the Apache UIMA web site) to help facilitate and accelerate work in deep content analytics. UIMA is also now an OASIS standard. UIMA-AS implements UIMA on asynchronous messaging middleware. DeepQA and the Watson system uses UIMA-AS as its principal infrastructure for assembling, scaling-out and deploying all its analytic components.

UIMA è probabilmente la chiave. Dalla descrizione di Apache UIMA :

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.

Questo comunicato stampa di Apache ha qualche informazione in più:

Hundreds of Apache UIMA Annotators and thousands of algorithms help Watson –which runs disconnected from the Internet– access vast databases to simultaneously comprehend clues and formulate answers. Watson then analyzes 500 gigabytes of preprocessed information to match potential meanings for the question and a potential answer to the question. Helping Watson do this is:

  1. Apache UIMA: standards-based frameworks, infrastructure and components that facilitate the analysis and annotation of an array of unstructured content (such as text, audio and video). Watson uses Apache UIMA for real-time content analytics and natural language processing, to comprehend clues, find possible answers, gather supporting evidence, score each answer, compute its confidence in each answer, and improve contextual understanding (machine learning) – all under 3 seconds.

  2. Apache Hadoop: software framework that enables data-intensive distributed applications to work with thousands of nodes and petabytes of data. A foundation of Cloud computing, Apache Hadoop enables Watson to access, sort, and process data in a massively parallel system (90+ server cluster/2,880 processor cores/16 terabytes of RAM/4 terabytes of disk storage).

The Watson system uses UIMA as its principal infrastructure for component interoperability and makes extensive use of the UIMA-AS scale-out capabilities that can exploit modern, highly parallel hardware architectures. UIMA manages all work flow and communication between processes, which are spread across the cluster. Apache Hadoop manages the task of preprocessing Watson's enormous information sources by deploying UIMA pipelines as Hadoop mappers, running UIMA analytics.

Piuttosto interessante:)

    
risposta data 30.06.2011 - 16:39
fonte