Total Pageviews

Wednesday, 28 May 2014

Welcome to the IBM Text Analytics Catalog



The Text Analytics Catalog is an indispensable tool that allows you to easily extend the text analytic capabilities of the IBM Watson Content Analytics. With hundreds of text analytics to choose from (and still growing) the Text Analytics Catalog jump starts your content analysis task, easily and effectively. Be sure to click on the "Getting Started" tab and learn about some optional tools that will make your experience with the catalog that much easier. Once you have completed the getting started section you are ready to go.
There are several ways to find the text analytics you need. The table of catalog contents is listed below organized by domain and functional categories. Click on a category to view the list of text analytics specific to that category. You can view the complete list of catalog contents by clicking on the "Catalog Entries" tab. And lastly you can use the "search" feature available in the banner above to find the exact text analytics you are looking for


Table of Contents
============

Academia
Text analytics related to academics (universities, degrees, etc...)
Anatomy
Text analytics for the human body
Animals
Text analytics for the animal kingdom
Astronomy 
Text analytics to identify astronomical features
Automotive
Automotive industry analytics
Aviation
Aviation industry analytics
Bio Organisms
Text analytics for the study of bio organisms
Bio Chemistry
Bio Chemistry analytics
Chemistry
Text analytics for non organic chemistry
Computers
Computer related text analytics
Consumer Goods
Consumer related text analytics
Entertainment
Entertainment related text analytics
Finance
Text analytics related to the finance industry
General
General purpose analytics, to include...
Books
Book metadata extractors (e.g., ISBN, Publication date, etc...) and popular literature
Data and Time
Date and time analytics supporting a variety of formats
Measurements
Measurement related analytics (e.g., distance, weight, speed, etc...)
Money
U.S. and foreign currency analytics (e.g., amounts, conversions, etc...)
Numbers
U.S. and foreign numeric analytics
Phone
U.S. and foreign phone number analytics in varying formats (e.g., long distance, international, local exchanges, etc...)
Document
General analytics assisting in the parsing of a document (e.g., zoning, etc..)
Geography
Geography related analytics
Features
Identifies geograpic features such as rivers, seas, gulfs, islands, lakes, channels, capes, etc...
Cities
U.S. and foreign city identifiers
Locations
Specific locations of interest
Roads
Road identifiers including interstates, parkways and street names
Healthcare
Healthcare related analytics (Drugs, medicines, diseases, etc...)
Linguistics
Language related analytics
Law Enforcement
Law enforcement related analytics
Legal
Legal related analytics
Metals
Dictionaries on verious kinds of metals and by products
Military
Military related analytics
Mineralogy
Analytics for the study of rocks and minerals
Miscellaneous
Miscellaneous text analytics that are not easily classified
Organizations
Orgnaizations of all kinds including: political, religious, international, etc...
People
Person related analytics
Plants
Text analytics for the identifcation of trees, plants, and shrubs
Telecommunications
Text analytics for the telecommunications industry (e.g., call center analytics)
Tools
Tools and utilities for building and debugging your text analytics

Friday, 23 May 2014

Real-time NLP API.....

The real-time natural language processing (NLP) API allows users to perform ad-hoc text analytics on documents.
Real-time text analysis uses the existing text analytics resources that are defined for a collection, but analyzes documents without adding them to the index. Users can immediately check the analysis results without waiting for the index to be built or updated.

Requirements

The following system set-up is required to use the real-time NLP API:
  • Real-time NLP requires a text analytics collection that hosts text analytics resources.
  • Administrators configure the collection for real-time NLP by configuring the facet tree, dictionaries, and patterns for text extraction, just as they would for typical text analytics collections. The result of real-time NLP reflects the configuration of that collection.
  • The parse and index sessions for the collection must be running because these sessions provide the document processing engine for the real-time NLP API.
  • Search sessions for the collection must be running because these sessions serve as the gateway for the real-time NLP API

Sample plug-in application for non-web crawlers

The sample crawler plug-in application shows how you can change security token values, metadata, and the content of crawled documents.
package sample;

import java.io.BufferedWriter;
import java.io.OutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.util.ArrayList;
import java.util.List;

import com.ibm.es.crawler.plugin.AbstractCrawlerPlugin;
import com.ibm.es.crawler.plugin.Content;
import com.ibm.es.crawler.plugin.CrawledData;
import com.ibm.es.crawler.plugin.CrawlerPluginException;
import com.ibm.es.crawler.plugin.FieldMetadata;

/**
 * The <code>MyCrawlerPlugin</code> is a sample crawler plugin module.
 */
public class MyCrawlerPlugin extends AbstractCrawlerPlugin {

   
   /**
    * Default constructor.
    */
   public MyCrawlerPlugin() {
      super();
   }

   /**
    * Initialize this object.
    * 
    * This sample program has nothing in this method.
    * 
    * @see com.ibm.es.crawler.plugin.AbstractCrawlerPlugin#init()
    */
   public void init() throws CrawlerPluginException {

      /*
       * [Tips]
       * If your crawler plugin module requires something to do for 
       * initialization, add the code here.
       * [Example]
       * Get JDBC connection for your local system.
       * connection = DriverManager.getConnection("jdbc::db2::xxxx);
       */

   }

   /**
    * Returns the Boolean value for metadata usage.
    * 
    * This sample program returns <code>true</code>.
    *  
    * @see com.ibm.es.crawler.plugin.AbstractCrawlerPlugin#isMetadataUsed()
    */
   public boolean isMetadataUsed() {

      /*
       * [Tips]
       * If your crawler plugin module updates both metadata and security 
       * tokens, returns true.
       * If your cralwer plugin module updates security tokens only, 
       * returns false.
       * [Example]
       * Close JDBC connection for your local system.
       * connection.close(); 
       */
      return true;
   }

   /**
    * Terminate this object.
    * 
    * This sample program has nothing in this method.
    * 
    * @see com.ibm.es.crawler.plugin.AbstractCrawlerPlugin#term()
    */
   public void term() throws CrawlerPluginException {

      /*
       * [Tips]
       * If your crawler plugin module requires something to do 
       * for termination, add the code here.
       */

      return;

   }

   /**
    * Update crawled data.
    * 
    * This sample program updates the security tokens.
    * 
    * @see com.ibm.es.crawler.plugin.AbstractCrawlerPlugin#updateDocument
      (com.ibm.es.crawler.plugin.CrawledData)
    */
   public CrawledData updateDocument(CrawledData crawledData) 
   throws CrawlerPluginException {

      // Get uri string, security tokens, and field metadata
      String url = crawledData.getURI();
      String securityTokens = crawledData.getSecurityTokens();
      List metadataList = crawledData.getMetadataList();
      if (metadataList == null) {
         metadataList = new ArrayList();
      }

      /*
       * [Tips]
       * If your crawler plugin module rejects some crawled data,
       * add the check code here and returns null. 
       */
      // This sample always returns updated document.
      if (false) {
         return null;
      }

      /*
       * [Tips]
       * If your crawler plugin module updates the security tokens,
       * add the code here.
       */
      // update security token (for sample)
      String newToken = "SampleToken";
      String newSecurityTokens = securityTokens + "," + newToken;
      crawledData.setSecurityTokes(newSecurityTokens);

      /*
       * [Tips]
       * If your crawler plugin module updates metadata,
       * add the code here.
       */
      // update metadata (for sample)
      FieldMetadata newFieldMetaData = new FieldMetadata("copyright", "IBM");
      metadataList.add(newFieldMetaData);
      crawledData.setMetadataList(metadataList);
      
      
      /*
       * Set language. 
       */
      crawledData.setLanguage("en");
      crawledData.setLanguageAutoDetection(true);
      
      /*
       * Update Content. since 8.3
       */
      Content content = crawledData.getOriginalContent();
      
      java.io.InputStream in = null;
      
      try{
         // if the original crawled content is null, create the new content.
         if(content == null){
            crawledData.createNewContent();
            content = crawledData.createNewContent();
         } else {
            // if the original crawled content exists, get InputStream 
            // object to access it.
            in = content.getInputStream();
            
            // read the content
            
            in.close();
         }
      }catch(IOException ioe){
         throw new CrawlerPluginException(ioe);
      }
      
      // set information against the content.
      content.setCodepage("UTF-8");
      content.setCodepageAutoDetection(true);
      content.setMimeType("text/plain");

// Overwrite the content.
      try{
         
         OutputStream outputStream = content.getOutputStream();

         // write content to OutputStream
         String newText = "The new content of plain text ";
         BufferedWriter br = new BufferedWriter(new OutputStreamWriter
         (outputStream, "UTF-8"));
         br.write(newText);
         br.flush();
         br.close();
         
      }catch(IOException ioe){
         throw new CrawlerPluginException(ioe);
      }
      
      // Submit change for the content.
      crawledData.submitContent(content);
      
      return crawledData;
   }
 
   /* (non-Javadoc)
    * @see com.ibm.es.crawler.plugin.AbstractCrawlerPlugin#isContentUsed()
    */
   public boolean isContentUsed() {
      return true;
   }

}

Sentiment Analysis

To analyze the sentiment of some text, do an HTTP POST to http://text-processing.com/api/sentiment/ with form encoded datacontaing the text you want to analyze. You’ll get back a JSON object response with 2 attributes:
label:will be either pos if the text is determined to be positiveneg if the text is negative, or neutral if the text is neither posnor neg.
probability:an object that contains the probability for each label. neg and pos will add up to 1, while neutral is standalone. If neutralis greater than 0.5 then the label will be neutral. Otherwise, the label will be pos or neg, whichever has the greater probability.
Here’s some examples using curl:
$ curl -d "text=great" http://text-processing.com/api/sentiment/
{
        "probability": {
                "neg": 0.39680315784838732,
                "neutral": 0.28207586364297021,
                "pos": 0.60319684215161262
        },
        "label": "pos"
}

$ curl -d "text=terrible" http://text-processing.com/api/sentiment/
{
        "probability": {
                "neg": 0.68846305481785608,
                "neutral": 0.38637609994709854,
                "pos": 0.31153694518214375
        },
        "label": "neg"
}

$ curl -d "text=hi friend" http://text-processing.com/api/sentiment/
{
        "probability": {
                "neg": 0.59797768649386562,
                "neutral": 0.74939503025120124,
                "pos": 0.40202231350613421
        },
        "label": "neutral"
}
You can also get sentiment for the dutch language:
$ curl -d "language=dutch&text=goed boek" http://text-processing.com/api/sentiment/
{
        "probability": {
                "neg": 0.22499999999999998,
                "neutral": 0.099999999999999978,
                "pos": 0.77500000000000002
        },
        "label": "pos"
}
Try the sentiment analysis demo to get a feel for the results.

Parameters

text:Required - the text you want to analyze. It must not exceed 80,000 characters.
language:The default language is english, but this API also supports dutch and french.

Return Value

On success, a 200 OK response will be returned containing a JSON object that looks like this:
{
        "label": "pos",
        "probability": {
                "pos": 0.85,
                "neg": 0.15,
                "neutral": 0.4
        }
}

Errors

400 Bad Request response will be returned under the following conditions:
  • no value for text is provided
  • text exceeds 80,000 characters
503 Throttled response will be returned if you exceed the daily request limit. Signup for the Mashape Text-Processing API to get a higher limit plan.

Saturday, 17 May 2014

IBM Text Analytics , Turf Xumo, Pingar, IDOL Server.

The Information Management and Analytics group at IBM Research is seeking applications for Researcher and fixed- term Postdoc positions from fresh and experienced researchers with a PhD in Computer Science from a reputed institution in the areas of database systems, information retrieval, distributed computing, information integration, business intelligence, data/ text mining, and big data platforms. The candidate should have research orientation and a proven track record in identifying and solving research challenges specific to the area of research. The candidate should also have a strong record of publications in leading conferences and journals. Our current areas of interest include: Managing uncertain data at scale, including issues relating to data quality and analytics over uncertain enterprise, web, sensor and human- generated data Data fusion, ETL, information integration, entity resolution, and analytics over structured and unstructured multimodal data ?Systems, frameworks, and techniques for scalable information extraction, indexing, and search over massive volumes of unstructured and semi- structured data, including machine generated and log data. ?All aspects of big data including discovery, curation, governance and analytics leveraging scale- out platforms such as Hadoop and NoSQL databases ??Advanced business intelligence, data mining and predictive analytics for various domains such as marketing, banking, etc Spatio- temporal and geo- spatial data analysis Graph Analytics at scale on platforms like TinkerPop and Titan. Integration of Graphs with structured (RDBMS) or unstructured text documents Data privacy and security in the context of cloud based data services Core database technologies Candidates are expected to generate novel ideas as well as invent or design complex products and processes; engineer these ideas to an advanced state of feasibility by evaluating them and participating in their implementation; connect to other business units and customers to identify and understand business problems and pain points; and represent IBM at professional forums and in professional societies. 

The Big Problem

Preparing data is painful

Big Data relevant for analysis exists in many formats (structured, semi-structured and unstructured), across various systems (private and public) and refreshes at varying frequencies (static and streaming). It takes many person-weeks of effort to just get data in a unified, analyzable format.

Searching for relevant data is expensive

With large and varying data sets, it is impossible to comprehensively know what data exists where. Searching for and selecting the right features for analysis is an iterative, time consuming and clumsy exercise.

Analytical models are complex and expensive to build

Specialized statistical and Machine Learning models need to be scripted by specialists before they can be applied on data. The cost of time and talent is so expensive that many problems just get a BI treatment. And BI on Big Data is just not good enough.

Analytical models are not effective for long

Orthodox approaches do not self-learn when data is refreshed. The cost of maintaining and refreshing analytics models get alarmingly high.

Result

The cost of discovering and solving use cases is exorbitant! It takes months of plodding before a use-case can be finally solved to meet the exacting requirements of business.