Total Pageviews
Saturday, 28 June 2014
Thursday, 12 June 2014
Companies Offering Text Analytics Solutions
Attensity
The technology from Attensity allows users to extract and analyze facts like who, what, where, why, under what conditions and to whom, as well as opinions and events found in unstructured data. It then creates output that is fused with existing structured data so that it can be analyzed using Attensity's applications or by using business intelligence applications already installed in the client organisation. Attensity’s text analytics suite comprises Attensity Discover, Attensity Analytics, Attensity Server, Attensity Workstation, Attensity Integration, Attensity Knowledge Engineering. More info
ClearForest
(“Text-Driven Business Intelligence”) The ClearForest Text Analytics solution consist of a tagging and extraction platform, an analytics platform, and a development environment. The platforms can be used "out of the box" or as customised add-on analytics and extraction . ClearForest is now part of the Reuters group.
Content Analyst®
("Stop searching. Start doingTM") The core Content Analyst technology is Conceptual Comparison, which automatically locates text (ranging from a single word to an entire book) within in a conceptual representation space and so can provide a measure of the similarity of any two items represented in that space. Other related technologies include Blind Search; Categorization: automatic categorization or exemplar-based; Contextual Explanation: when encountering a unfamiliar word, a pop-up will list ten similar terms. Can also present terms associated with a particular document. Relationship Discovery: the discovery of subtle relationships between entities based on implicit patterns on a SQL-like query system. More details A demo can be scheduled.
Inxight
Inxight
(“Transforming Text into Actionable Information”) Based in Sunnyvale, the company spun off from Xerox Parc in 1997. It is involved with extraction and visualization technologies in a variety of languages. The company products allow “customers to access, cluster, and be alerted to relevant information contained in the open Web, deep Web (patent databases, SEC filings), subscription, and internal sources.” Allows the annotation of documents for named entities (persons, companies, places, weapons, addresses, and dates) and events (M&A activity and travel activities) and relationships among entities. Entity extraction can also be customized and visualization packages include StarTree ®, TableLenstm, and TimeWall.
SPSS
SPSS solutions combines the LexiQuest text mining products with the data mining workbench, Clementine to reveal concepts in text, which can then be combined with other data to create predictive models. The SPSS system is referred to as Predictive Text AnalyticsTM. LexiQuest CategorizeTM, LexiQuest MineTM, SPSS Text Analysis for SurveysTM,.
Lexalytics
Lexalytics provides software for entity extraction, sentiment analysis, document summarization and thematic extraction for today's businesses..
Textifer
Texifter "improves efficiency by streamlining the process of sorting large amounts of unstructured text. Texifter offers off-the-shelf enterprise class business applications specifically developed to meet the complex needs of researchers and federal rule writers. Texifter utilizes SaaS & cloud-based solutions for topic modeling, duplicate detection, and other information retrieval tasks involving users in an active learning loop. "
Other Companies
Apollo Data Technologies
Autonomy
Basis Technology
Convera
FAST
Janya
Megaputer
Notiora
Nstein
Skitegic
Temis
Teragram
Thursday, 5 June 2014
What is Amazon Cloud ?
Hello, my name is Ravi Shankar Gupta, and I'm here today to talk to
you about Amazon Cloud. What is Amazon Cloud? Well, the proper
name for that is Amazon Web Services or AWS. AWS is the public
cloud. It's an offering which can be best described as
Infrastructure as a Service, and what that means is it allows the
public, in general, to be able to procure or rent from AWS,
resources for performing compute tasks, storage, as well as
networking.
The way you pay for these resources is when you have used the
resource of some kind, be that computer storage, you would be
paying for that. So you're only paying for what you actually use
and only when you use it. That is something that is referred to
as consumption-based pricing.
Just as you would deploy resources in your own data center, in
particular locations, in the Amazon EC2 Cloud, which stands for
elastic compute cloud, you would be deploying resources in one of
the data centers that is provisioned, built, staffed, and
operated by the AWS team.
The AWS team does operate in six different regions around the
world. One of them is called the US-East and is located in
Virginia. There are two regions in the US-West, one is called,—
one is in Northern California. The other one is in Oregon.
There is also a region in Europe, based in Ireland, as well as
two regions in the Asia-Pacific, one in Singapore and one in
Tokyo.
There is also a special region called GovCloud, which is designed
exclusively for use by the United States governmental agencies.
The federal, state, and municipal levels. Different regions have
multiple availability zones. What availability zones are
designed to do is to protect you in the case of failure, if you
choose to deploy your applications in your infrastructure across
multiple availability zones.
Amazon designs these in such a way that they are powered using
separate power grids, as well as located in different flood
plains and have different connectivity from the Internet, which
allows you to survive a failure of one availability zone and
relocation of your application to a different availability zone.
Virtual Private Cloud is very interesting capability that Amazon
EC2 offers. It's really designed to address the needs for secure
and private ways of conducting computing tasks. When we talk
about the cloud, your resources are residing in a third-party
facility, and they are accessible through a public Internet.
What Virtual Private Cloud does is it isolates the traffic that
you have and puts it over a secure network link using IPsec
protocol between your data center and your resources in the
1
public cloud. What it also does is it locates your systems, even
though they're running a shared infrastructure, it locates them
behind your firewalls, and makes them protected by the rules and
processes that are instituted by your organization. They are
running on the IP address range that belongs to your company or
organization.
When dealing with computing equipment, we do need to understand
the basic architecture of these systems. The architectures that
Amazon EC2 supports are Intel-based architectures for both 32-bit
and 64-bit. The way we use those resources is virtual machines
or in the case of Amazon Cloud, they're called Instances.
What you can do is you can load one of the two types of the
operating systems on your instances. It can either be Linux, and
there are multiple, different distributions of Linux that are
available or it can be Windows.
The machines themselves, or Instances, are grouped into families
to designate their size and capacity. There's a Micro family, a
family of Standard Instances, a family of High-Memory Instances,
High-CPU Instances, as well as Cluster Compute Instances. We're
going to take a look at each one just so you get the flavor of
what that really provides you.
When we're looking at the Standard Instances, we're really
looking at types of compute capacity or virtual machines that are
designed for mixed workloads. There are three types of Standard
Instances that are available. There is a small, large, and
extra-large, and you'll notice the names in here are not
corresponding to what—we're typically used to model numbers in
the physical hardware world.
Small is a good example. It's the Instance that gets used by
default. It provides you with 1.7 GB of memory. It provides an
equivalent of about one core. It has some local storage attached
to it, 160 GB drive. It is a 32-bit platform, and the I/O
performance, which is network I/O performance, is described as
moderate.
If you look at the other two Instances, they are considerably
larger, provide more memory and more compute power, as well as
higher I/O. What is most important for people dealing with data
is those two are 64-bit. We find that that our customers are
relying on 84-bit platforms for their production systems, and 32-
bit systems are really only used for experimentation or learning.
High-memory instances are designed for workloads, as the name
would suggest where the emphasis is on more memory to achieve a
higher level of performance rather than compute power. There are
three sizes available this time, extra-large, double extra-large,
and quadruple extra-large. You can see that you can get as much
2
memory as 68.4 GB in a single instance.
It's a fairly sizable machine. All of the instances in here, in
the high-memory category, are 64-bit, as you would expect because
you do need 64-bit addressing to take advantage of all that
memory.
High-CPU instances are designed for environments and workloads
where CPU power is much more important than memory, so the
balance is tilted towards CPU power. There are two types of
instances available, medium and extra-large. You can see the
prices in here, and I do want to caution about using these
prices. Please do use them as an illustration only. You can
find the real prices, because the prices do change, and they are
different in different regions. You'll find them at the URL at
the bottom of these slides.
Also important to note that the prices in here are applying to
something called demand pricing, which means it is the pricing
that you will get by just asking for an instance whenever you ask
for it. There is a different kind of pricing, which is called
reserved instances, where you do pre-pay an amount upfront on
either a one-year or a three-year term in exchange for a much
lower price on an hourly basis.
There are also other types of pricing called spot pricing, very
similar to what we're used to when dealing with commodities. The
prices in there fluctuate, and you can start your instances to
take advantage of a lower price, for example, in the market.
Cluster Compute instances are a very special type of instance,
and the reason they're special is the virtualization
infrastructure is much more direct to the hardware, so it's got a
smaller impact on the performance, but more important, the
instances that are cluster compute instances can communicate with
each other over a very high bandwidth, 10 GB Ethernet networking.
That's a great capability that you may employ when dealing with
very large data transfers in between your instances.
For example, if you were to run Hadoop, map-reduced jobs, or
databases where you do have the need to put a cluster of
databases out there with the high bandwidth connectivity via
nodes. These types of instances are very beneficial.
There is another type of instance called Micro Instance. It is
available in both 32- and 64-bit platforms. However, you will
notice that the amount of memory that you get in that instance is
very low. On the compute capacity side, it actually gives you a
reasonable amount of compute power in bursts. However, because
it has so little memory, it really is not designed for
production, database, or big data workloads.
3
However, it's a fantastic way to learn the environment as you
can, if you are a new customer to Amazon Web Services, get this
for free for a year. It includes the compute or the instances
themselves as well as some storage as well as some bandwidth for
data transfer. A fantastic way to learn.
When it comes to storage, there are really three types of storage
that are available on the Amazon Cloud. There is something
called Instance Storage, Elastic Block Store or EBS, and S3.
We're going to talk about each one in sequence here.
Instance storage is something that is—that you do get for free.
It is attached to your instance, as the name would suggest, but
one of the key aspects of the instance storage is it is nonpersistent.
What that means is when you shut down your virtual
machine or instance or if it was to fail, the storage that you
have that is instant storage, will be gone. So will the data
that is residing on the storage. This data is non-recoverable.
It's a very important point to address and understand. Instant
storage is available or included in the price of your instance,
but is non-persistent.
Elastic Block Store or EBS is the persistent store. It does
behave and look identical to any disk, right, that you would
typically mount and attach to your operating system. These disk
drives can be allocated in sizes between 5 MB and 1 TB. You can
mount multiple volumes of EBS storage and attach them to your
machine. You can RAID these volumes just like you would normal
hard disks. However, there is a charge for this storage. The
charges are for reserved capacity. Based on the size of the
drive that you've allocated, you'll be charged for whatever the
size of the drive is on a monthly basis, and the prices are
$0.10/GB/month.
There is an additional charge for I/O operations of $0.10/million
IOPS. However, in my experience, these charges don't really
amount to any significant dollar value. But there is a charge
that you should be aware of.
Last, but not least, there is another type of storage called S3.
The best way to think of this is a file, older-based storage,
accessible to REST APIs, so it's not accessible through the
normal disk read-write APIs. You cannot mount it as a disk. It
is a great way to store large amounts of data, for example, if
you're going to be bringing into your cluster or into your
database. It's also a great way to store backup for a database
and so on.
The charges in here are based on the amount of storage actually
used, so there's no reservation for this storage. They start at
$0.14/GB/month. There are also charges for data transfers, and
we're dealing with data, we're always concerned about what will
4
it cost for us to move the data into the cloud or get it out of
the cloud when need be.
There are multiple ways of moving data in and out of the cloud.
We're going to first talk about the way of transferring it over
the network or the Internet. When transferring over the network,
transfers of the data into the cloud are free. However,
transfers out of the cloud do incur the charge. The first
gigabyte is free, and then you're charged on a sliding scale,
meaning the more you transfer, the less it becomes per gigabyte.
Those charges start at about $0.12/GB.
When you need to transfer very large volumes of data, I want you
to take a look at the third option, which is import/export. It
is an option that is based on your ability to send a portable
storage device to the proper AWS data center and have them import
the data in for you and then return the device back to you.
There is a charge for this service. It's $80 for handling the
device, as well as $2.49/data-loading hour. As I mentioned
before, you get the device returned to you if you reside in the
United States, and regular shipping is acceptable. There is no
charge for that. Outside of the United States or for expedited
shipping, there is an additional shipping charge.
In addition to transferring the data in and out of the Amazon
Cloud, you will, on occasion, and if you are dealing with
clusters of computers, such as those used in Hadoop or in cluster
databases, you will have the need to transfer the data between
the machines. If your machines in your cluster reside in the
same availability zone, there is no charge for data transfers.
If you elected to locate your machines in multiple different
availability zones to deliver the resiliency, then you will be
charged for the data transfer between the machines, and those
data transfers are (inaudible). This concludes this lesson and
presentation. I thank you for your attention and wish you the
best of luck with the rest of the course.
5
you about Amazon Cloud. What is Amazon Cloud? Well, the proper
name for that is Amazon Web Services or AWS. AWS is the public
cloud. It's an offering which can be best described as
Infrastructure as a Service, and what that means is it allows the
public, in general, to be able to procure or rent from AWS,
resources for performing compute tasks, storage, as well as
networking.
The way you pay for these resources is when you have used the
resource of some kind, be that computer storage, you would be
paying for that. So you're only paying for what you actually use
and only when you use it. That is something that is referred to
as consumption-based pricing.
Just as you would deploy resources in your own data center, in
particular locations, in the Amazon EC2 Cloud, which stands for
elastic compute cloud, you would be deploying resources in one of
the data centers that is provisioned, built, staffed, and
operated by the AWS team.
The AWS team does operate in six different regions around the
world. One of them is called the US-East and is located in
Virginia. There are two regions in the US-West, one is called,—
one is in Northern California. The other one is in Oregon.
There is also a region in Europe, based in Ireland, as well as
two regions in the Asia-Pacific, one in Singapore and one in
Tokyo.
There is also a special region called GovCloud, which is designed
exclusively for use by the United States governmental agencies.
The federal, state, and municipal levels. Different regions have
multiple availability zones. What availability zones are
designed to do is to protect you in the case of failure, if you
choose to deploy your applications in your infrastructure across
multiple availability zones.
Amazon designs these in such a way that they are powered using
separate power grids, as well as located in different flood
plains and have different connectivity from the Internet, which
allows you to survive a failure of one availability zone and
relocation of your application to a different availability zone.
Virtual Private Cloud is very interesting capability that Amazon
EC2 offers. It's really designed to address the needs for secure
and private ways of conducting computing tasks. When we talk
about the cloud, your resources are residing in a third-party
facility, and they are accessible through a public Internet.
What Virtual Private Cloud does is it isolates the traffic that
you have and puts it over a secure network link using IPsec
protocol between your data center and your resources in the
1
public cloud. What it also does is it locates your systems, even
though they're running a shared infrastructure, it locates them
behind your firewalls, and makes them protected by the rules and
processes that are instituted by your organization. They are
running on the IP address range that belongs to your company or
organization.
When dealing with computing equipment, we do need to understand
the basic architecture of these systems. The architectures that
Amazon EC2 supports are Intel-based architectures for both 32-bit
and 64-bit. The way we use those resources is virtual machines
or in the case of Amazon Cloud, they're called Instances.
What you can do is you can load one of the two types of the
operating systems on your instances. It can either be Linux, and
there are multiple, different distributions of Linux that are
available or it can be Windows.
The machines themselves, or Instances, are grouped into families
to designate their size and capacity. There's a Micro family, a
family of Standard Instances, a family of High-Memory Instances,
High-CPU Instances, as well as Cluster Compute Instances. We're
going to take a look at each one just so you get the flavor of
what that really provides you.
When we're looking at the Standard Instances, we're really
looking at types of compute capacity or virtual machines that are
designed for mixed workloads. There are three types of Standard
Instances that are available. There is a small, large, and
extra-large, and you'll notice the names in here are not
corresponding to what—we're typically used to model numbers in
the physical hardware world.
Small is a good example. It's the Instance that gets used by
default. It provides you with 1.7 GB of memory. It provides an
equivalent of about one core. It has some local storage attached
to it, 160 GB drive. It is a 32-bit platform, and the I/O
performance, which is network I/O performance, is described as
moderate.
If you look at the other two Instances, they are considerably
larger, provide more memory and more compute power, as well as
higher I/O. What is most important for people dealing with data
is those two are 64-bit. We find that that our customers are
relying on 84-bit platforms for their production systems, and 32-
bit systems are really only used for experimentation or learning.
High-memory instances are designed for workloads, as the name
would suggest where the emphasis is on more memory to achieve a
higher level of performance rather than compute power. There are
three sizes available this time, extra-large, double extra-large,
and quadruple extra-large. You can see that you can get as much
2
memory as 68.4 GB in a single instance.
It's a fairly sizable machine. All of the instances in here, in
the high-memory category, are 64-bit, as you would expect because
you do need 64-bit addressing to take advantage of all that
memory.
High-CPU instances are designed for environments and workloads
where CPU power is much more important than memory, so the
balance is tilted towards CPU power. There are two types of
instances available, medium and extra-large. You can see the
prices in here, and I do want to caution about using these
prices. Please do use them as an illustration only. You can
find the real prices, because the prices do change, and they are
different in different regions. You'll find them at the URL at
the bottom of these slides.
Also important to note that the prices in here are applying to
something called demand pricing, which means it is the pricing
that you will get by just asking for an instance whenever you ask
for it. There is a different kind of pricing, which is called
reserved instances, where you do pre-pay an amount upfront on
either a one-year or a three-year term in exchange for a much
lower price on an hourly basis.
There are also other types of pricing called spot pricing, very
similar to what we're used to when dealing with commodities. The
prices in there fluctuate, and you can start your instances to
take advantage of a lower price, for example, in the market.
Cluster Compute instances are a very special type of instance,
and the reason they're special is the virtualization
infrastructure is much more direct to the hardware, so it's got a
smaller impact on the performance, but more important, the
instances that are cluster compute instances can communicate with
each other over a very high bandwidth, 10 GB Ethernet networking.
That's a great capability that you may employ when dealing with
very large data transfers in between your instances.
For example, if you were to run Hadoop, map-reduced jobs, or
databases where you do have the need to put a cluster of
databases out there with the high bandwidth connectivity via
nodes. These types of instances are very beneficial.
There is another type of instance called Micro Instance. It is
available in both 32- and 64-bit platforms. However, you will
notice that the amount of memory that you get in that instance is
very low. On the compute capacity side, it actually gives you a
reasonable amount of compute power in bursts. However, because
it has so little memory, it really is not designed for
production, database, or big data workloads.
3
However, it's a fantastic way to learn the environment as you
can, if you are a new customer to Amazon Web Services, get this
for free for a year. It includes the compute or the instances
themselves as well as some storage as well as some bandwidth for
data transfer. A fantastic way to learn.
When it comes to storage, there are really three types of storage
that are available on the Amazon Cloud. There is something
called Instance Storage, Elastic Block Store or EBS, and S3.
We're going to talk about each one in sequence here.
Instance storage is something that is—that you do get for free.
It is attached to your instance, as the name would suggest, but
one of the key aspects of the instance storage is it is nonpersistent.
What that means is when you shut down your virtual
machine or instance or if it was to fail, the storage that you
have that is instant storage, will be gone. So will the data
that is residing on the storage. This data is non-recoverable.
It's a very important point to address and understand. Instant
storage is available or included in the price of your instance,
but is non-persistent.
Elastic Block Store or EBS is the persistent store. It does
behave and look identical to any disk, right, that you would
typically mount and attach to your operating system. These disk
drives can be allocated in sizes between 5 MB and 1 TB. You can
mount multiple volumes of EBS storage and attach them to your
machine. You can RAID these volumes just like you would normal
hard disks. However, there is a charge for this storage. The
charges are for reserved capacity. Based on the size of the
drive that you've allocated, you'll be charged for whatever the
size of the drive is on a monthly basis, and the prices are
$0.10/GB/month.
There is an additional charge for I/O operations of $0.10/million
IOPS. However, in my experience, these charges don't really
amount to any significant dollar value. But there is a charge
that you should be aware of.
Last, but not least, there is another type of storage called S3.
The best way to think of this is a file, older-based storage,
accessible to REST APIs, so it's not accessible through the
normal disk read-write APIs. You cannot mount it as a disk. It
is a great way to store large amounts of data, for example, if
you're going to be bringing into your cluster or into your
database. It's also a great way to store backup for a database
and so on.
The charges in here are based on the amount of storage actually
used, so there's no reservation for this storage. They start at
$0.14/GB/month. There are also charges for data transfers, and
we're dealing with data, we're always concerned about what will
4
it cost for us to move the data into the cloud or get it out of
the cloud when need be.
There are multiple ways of moving data in and out of the cloud.
We're going to first talk about the way of transferring it over
the network or the Internet. When transferring over the network,
transfers of the data into the cloud are free. However,
transfers out of the cloud do incur the charge. The first
gigabyte is free, and then you're charged on a sliding scale,
meaning the more you transfer, the less it becomes per gigabyte.
Those charges start at about $0.12/GB.
When you need to transfer very large volumes of data, I want you
to take a look at the third option, which is import/export. It
is an option that is based on your ability to send a portable
storage device to the proper AWS data center and have them import
the data in for you and then return the device back to you.
There is a charge for this service. It's $80 for handling the
device, as well as $2.49/data-loading hour. As I mentioned
before, you get the device returned to you if you reside in the
United States, and regular shipping is acceptable. There is no
charge for that. Outside of the United States or for expedited
shipping, there is an additional shipping charge.
In addition to transferring the data in and out of the Amazon
Cloud, you will, on occasion, and if you are dealing with
clusters of computers, such as those used in Hadoop or in cluster
databases, you will have the need to transfer the data between
the machines. If your machines in your cluster reside in the
same availability zone, there is no charge for data transfers.
If you elected to locate your machines in multiple different
availability zones to deliver the resiliency, then you will be
charged for the data transfer between the machines, and those
data transfers are (inaudible). This concludes this lesson and
presentation. I thank you for your attention and wish you the
best of luck with the rest of the course.
5
Subscribe to:
Posts (Atom)