Implementing NetCat Source using Flume

In the following tutorial, we will begin with a very simple flume agent that comprise of:

1. Source: NetCat as Source  (org.apache.flume.source.NetcatSource)
2.Channel: Memory Channel (org.apache.flume.channel.MemoryChannel)
3.Sink: Logger Sink (org apache flume sink LoggerSink), useful for testing/debugging purpose

Here, the netcat command will act as the actual source of data which will ingest data to the NetCat Source within the agent and store in the intermediate channel i.e. Memory and log all the events to the Logger Sink.

P.S: The ‘netcat’ command opens the connection between two machines and listen to the  stream. In our case we would be having a localhost only and stream would be the event (one line per text).

Let’s start by following:

1. Create a conf file ‘myflume.conf’ under conf directory:
agent.sources=s1
agent.channels=c1
agent.sinks=k1

agent.sources.s1.type=netcat
agent.sources.s1.channels=c1
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=12345

agent.channels.c1.type=memory
agent.sinks.k1.type=logger

agent.sinks.k1.channel=c1

2. Run the flume-ng command:

ubuntu-vb@ubuntu-vb:~/hadoop_repo/flume/flume152$ flume-ng agent -n agent -c conf -f conf/myflume.conf -Dflume.root.logger=INFO,console

+ exec /usr/lib/jvm/java-7-oracle/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp ‘/home/ubuntu-vb/hadoop_repo/flume/flume152/conf:/home/ubuntu-vb/hadoop_repo/flume/flume152/lib/*’ -Djava.library.path= org.apache.flume.node.Application -n agent -f conf/hw.conf
2015-05-10 15:15:01,500 (lifecycleSupervisor-1-0) [INFO – org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2015-05-10 15:15:01,514 (conf-file-poller-0) [INFO – org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloading configuration file:conf/hw.conf
2015-05-10 15:15:01,573 (conf-file-poller-0) [INFO – org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:k1
2015-05-10 15:15:01,590 (conf-file-poller-0) [INFO – org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:930)] Added sinks: k1 Agent: agent
2015-05-10 15:15:01,598 (conf-file-poller-0) [INFO – org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:k1
2015-05-10 15:15:01,674 (conf-file-poller-0) [INFO – org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:140)] Post-validation flume configuration contains configuration for agents: [agent]
2015-05-10 15:15:01,675 (conf-file-poller-0) [INFO – org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:150)] Creating channels
2015-05-10 15:15:01,699 (conf-file-poller-0) [INFO – org.apache.flume.channel.DefaultChannelFactory.create(DefaultChannelFactory.java:40)] Creating instance of channel c1 type memory
2015-05-10 15:15:01,726 (conf-file-poller-0) [INFO – org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:205)] Created channel c1
2015-05-10 15:15:01,749 (conf-file-poller-0) [INFO – org.apache.flume.source.DefaultSourceFactory.create(DefaultSourceFactory.java:39)] Creating instance of source s1, type netcat
2015-05-10 15:15:01,814 (conf-file-poller-0) [INFO – org.apache.flume.sink.DefaultSinkFactory.create(DefaultSinkFactory.java:40)] Creating instance of sink: k1, type: logger
2015-05-10 15:15:01,822 (conf-file-poller-0) [INFO – org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:119)] Channel c1 connected to [s1, k1]
2015-05-10 15:15:01,939 (conf-file-poller-0) [INFO – org.apache.flume.node.Application.startAllComponents(Application.java:138)] Starting new configuration:{ sourceRunners:{s1=EventDrivenSourceRunner: { source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE} }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@1b620a2 counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }
2015-05-10 15:15:02,029 (conf-file-poller-0) [INFO – org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel c1
2015-05-10 15:15:02,424 (lifecycleSupervisor-1-0) [INFO – org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:119)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2015-05-10 15:15:02,430 (lifecycleSupervisor-1-0) [INFO – org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: CHANNEL, name: c1 started
2015-05-10 15:15:02,441 (conf-file-poller-0) [INFO – org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink k1
2015-05-10 15:15:02,490 (conf-file-poller-0) [INFO – org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source s1
2015-05-10 15:15:02,502 (lifecycleSupervisor-1-2) [INFO – org.apache.flume.source.NetcatSource.start(NetcatSource.java:150)] Source starting
2015-05-10 15:15:02,660 (lifecycleSupervisor-1-2) [INFO – org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0:0:0:0:0:0:0:0:12345]
3. Open another terminal and type
ubuntu-vb@ubuntu-vb:~/hadoop_repo/flume/flume152$ nc localhost 12345
Hello Flume
OK

4. Now see at the earlier terminal
2015-05-10 15:18:36,762 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO – org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 48 65 6C 6C 6F 20 46 6C 75 6D 65 Hello Flume }

Here, it indicates the source from agent has accepted text string as an event , when sent from ‘nc’ command as an source of event, gone into memory channel and logged on the Sink i.e. as a log4j logger

Apache Flume

Flume is a distributed, scalable, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

In short, Flume is a distributed,Reliable, Scalable, Extensible service to move data from Source to a destination.

Flume helps in aggregating the generated logs from different source (or application service) at different machine (or in nodes of a cluster) that needs to analyze and processed on a Hadoop environment.

Reliable: Fault Tolerance and High Availability [Tunable data reliability levels]
Scalable: Horizontal Scalability of nodes [Can add more collectors to increase availability]
Extensible data model: Can deal with all kind of data/sources [Twitter, Syslog etc]

Flume-NG (Flume1.x) is a major overhaul of Flume-OG (Flume 0.x).

Flume Data flow model Terminology
A Flume data flow is a complete transport from Source to Sink. A Flow is a type of data source  like server logs, click streams etc. Following are the components plays a vital role in the flow:

a. Agent b. Event c. Source d. Channel e. Sink f. Interceptors (Optional)

(above image is from Apache flume website, copyrights to them)

Agent
– A Flume agent is a JVM process that hosts the components (Source, Channel, Sink) that allow Events to flow from an external source to a external destination.

Event
– An Event is a unit of data that flows through a one or more Flume agent. The Event flows from Source to Channel to Sink
– An Event comprises of zero or more headers, byte payloads and optional attributes.
– An header is an key-value pair.
– Flume uses a transactional approach to guarantee the reliable delivery of the events.

Source
– A Flume source consumes events from an external source like a web server, 3rd part api twitter.
– The external source sends events to Flume in a recognizable format.
– The external source sends events to Flume in a format that is recognized by the target Flume source
– A source writes Events to one or more channel

Channel
– A received event from Source stores it into one or more passive channels and keeps the event until it’s consumed by a Flume Sink.
– The Channel act as a glue between Source and Sink
– The Channel could be a in-memory (fast, non reliable and non recoverable) or disk based (slow, reliable, recoverable)

Sink
– the sink extracts the event from the channel and puts it in an external repository like the HDFS
– A Sink could be a HDFS-formatted file system or another Agent in the hop or text/console display or could be a null.

Flume Interceptors
-Interceptors are part of Flume’s extensibility model.
-They allow events to be intercept (can be modify/remove) as they pass between a source and a channel, and the developer is Interceptors can be chained together to form a processing pipeline.
– They are similar to Servlet Filter mechanism

Multi Hop Flows
– Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

So, in the case of a complex flow, the Sink forwards it to the other Flume source of the next Flume agent (next hop) in the flow.

Fan out flow
– fanning-out is a concept of delivering events to other parts of the Agent during the flow
– Replicating and Multiplexing are two types of Fanning out
– Replicating: Event is written to all the configured channels. This is default type of Fanning out mechanism
– Multiplexing: Event is written to a specific configured channels (E.g. on the basis of certain header key etc)

The External source and Target Flume Source event Format
– Generally the external source sends the event in a format that is recognizable to the Flume source.
– E.g. Avro/Thrift Flume source can be used to receive Avro/Thrift events from Avro/Thrift clients

Flume provides the following out of the box provider for Source, Channels and Sinks:
Source
1 Avro Source (type of log stream)
2 Thrift Source (type of log stream)
3 Exec Source
4 JMS Source: Converter
5 Spooling Directory Source
6 Twitter firehose Source (experimental)
7 Event Deserializers: LINE, AVRO, BlobDeserializer
8 NetCat Source (type of log stream)
9 Sequence Generator Source
10 Syslog Sources (type of log stream): Syslog TCP Source, Multiport Syslog TCP Source, Syslog UDP Source
11 HTTP Source: JSONHandler, BlobHandler
12 Legacy Sources: Avro Legacy Source, Thrift Legacy Source
13 Custom Source
14 Scribe Source

Channel
1. Memory Channel
2. JDBC Channel
3. File Channel
4. Spillable Memory Channel
5. Pseudo Transaction Channel
6. Custom Channel

Sinks
1. HDFS Sink
2. Logger Sink
3. Avro Sink
4. Thrift Sink
5. RC Sink
6. File Roll Sink
7. Null Sink
8. HBaseSinks: HBaseSink, AsyncHBaseSink
9. MorphlineSolrSink
10. ElasticSearchSink
11. Kite Dataset Sink (experimental)
12. Custom Sink

Hadoop YARN Overview

In the earlier version of Hadoop, the JobTracker works as a master process and coordinates all the activity with the TaskTracker. Each node has a Task-Tracker process that manages tasks on the individual node. The TaskTrackers communicate with and are controlled by the JobTracker. The JobTracker is responsible of resource management (managing job life-cycle, tracking resource availability, coordinating with task-trackers etc).

All in all, in this Hadoop architecture:
1.  MR is more focused on cluster management and data processing.
2. It is limited to batch oriented process technique and unable to support other available interactive, real time services, graph, machine learning and or other memory intensive algorithms.

So, with the introduction of YARN, it solves the above 2 key issues and shifted the single use from to multi purpose system and promotes loosely coupled architecture and divides the prime key responsibilities namely resource Management and Scheduling/monitoring into separate areas or daemons.

In YARN, JobTracker and TaskTracker is no longer exist and they get replaced by the following components:
1.  ResourceManager
2.  ApplicationMaster (AM) on per-application basis
3.  NodeManager (NM) on per-node slave.

In addition to above, “container” terminology also gets introduced.
Apart from YARN, HDFS also gets revamped and turns out as HDFS Federation (HDFS2) and provides the following features:
1.  NameNode HA
2.  Snapshots
3.  Federation

Hadoop1 Vs Hadoop2

h1Vsh2

Let’s quickly understand more on the Hadoop2 components:

1. ResourceManager
–  is primarily a pure scheduler.
–  It is strictly limited to arbitrating requests for available resources in the system made by the competing applications.
–  It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).

2. NodeManager (per-node slave)
– runs on each node in the cluster and takes direction from the ResourceManager.
– It is responsible for managing resources available on a single node.
– NM also oversees container’s life-cycle management; monitoring resource usage (memory, CPU) of individual containers.

Before getting into ApplicationMaster, understanding “container” is very important.

A container is a collection of physical resources such as RAM, CPU cores, and disks on a single node. A single node can host multiple containers with a minimum size of memory (e.g., 512 MB or 1 GB) and CPU.  The ApplicationMaster can request any container so as to occupy a multiple of the minimum size. A container is supervised by the NodeManager and scheduled by the ResourceManager. A Container life cycle (request and release of containers) may happen in a dynamic fashion at run time.

For instance, request and release of containers during map and reduce phase in a dynamic fashion.

3. ApplicationMaster
– is an instance of a framework-specific library or a processing-framework-specific library (E.g. Storm, Spark etc).
– AM negotiate resources from the ResourceManager and works effectively with the       NodeManager(s) to execute and monitor the containers and their resource consumption. Generally, the containers are logical holders for the processes that actually perform the work. – The actual data processing occurs within the Containers executed by the ApplicationMaster. A Container grants rights to an application to use a specific amount of resources (memory, cpu etc.) on a specific host.
– The ApplicationMaster must take the container and present it to the NodeManager managing the host, on which the container was allocated, to use the resources for launching its tasks.
– AM provides fault tolerance to the resources within the cluster rather than managing by RM and hence it scales with a greater extent.

Apache Sqoop

Apache Sqoop
Sqoop (“SQL-to-Hadoop”) is a tool designed to transfer data between Hadoop and relational databases. Sqoop allows to import data from a relational database management system (RDBMS) such as MySQL into the Hadoop Distributed File System (HDFS) for further processing using MapReduce program and then later export this processed data back into an RDBMS.

Sqoop is helpful in analysing certain behaviour (e.g. could be reading server logs) and wish to view the results of such analysis quite often. Triggering MR program would not be a feasible approach for a quick view of data (plus the fact that Hadoop systems are not good at quick reads and for smaller chunks).

So to overcome we can capture few data, import into HDFS, process it and export back to Hive or other data system for ad-hoc queries.

In this short tutorial, we will see how sqoop can be used to import data from the relational table to hdfs and vice versa.

Environment
Hadoop: hadoop-2.4.0.tar (Assuming the Hadoop is already installed)
Sqoop: sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar
(http://www.apache.org/dist/sqoop/1.4.5/)
MySQL JAR: mysql-connector-java-5.1.34.jar

Importing data from MySQL table to HDFS
Step 1: Creating database and table in mysql

mysql> create database sqoopdb;
mysql> use sqoopdb;
mysql> create table employee (name varchar(255), salary double(7,2));
mysql> insert into employee values (‘John’, 123456.66);
mysql> insert into employee values (‘Tim’,98544);

mysql> select * from employee;
+——+———-+
| name | salary   |
+——+———-+
| John | 54887.00 |
| Tim  | 98544.00 |
+——+———-+
2 rows in set (0.00 sec)

Step2: Copying ‘mysql-connector-java-5.1.34.jar’ to $SQOOP_HOME/lib directory

Note: Initially I have copied ‘mysql-connector-java-5.0.5.jar’, but due to this, I was getting the following exception (trimmed for verbosity):

“INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@4d3c7378 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries. java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@4d3c7378 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.”

“ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter”

Step3: Importing a table into HDFS
$ sqoop import –connect jdbc:mysql://localhost:3306/sqoopdb –username root –password root –table employee -m 1

INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
INFO tool.CodeGenTool: Beginning code generation
INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop240
Note: /tmp/sqoop-hduser/compile/8b0c322a9f8c2420e9bbd2dd079dea4d/employee.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/8b0c322a9f8c2420e9bbd2dd079dea4d/employee.jar
WARN manager.MySQLManager: It looks like you are importing from mysql.
WARN manager.MySQLManager: This transfer can be faster! Use the –direct
WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
INFO mapreduce.ImportJobBase: Beginning import of employee
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
INFO db.DBInputFormat: Using read commited transaction isolation
INFO mapreduce.JobSubmitter: number of splits:1

INFO mapreduce.Job: Job job_1427265867092_0001 running in uber mode : false
INFO mapreduce.Job:  map 0% reduce 0%
INFO mapreduce.Job:  map 100% reduce 0%
INFO mapreduce.Job: Job job_1427265867092_0001 completed successfully
INFO mapreduce.Job: Counters: 30

INFO mapreduce.ImportJobBase: Transferred 25 bytes in 47.9508 seconds (0.5214 bytes/sec)
INFO mapreduce.ImportJobBase: Retrieved 2 records.

Step4: Listing datafile content
$ hadoop dfs -ls -R employee
$ hadoop dfs -cat /user/hduser/employee/part-m-00000
John,54887.0
Tim,98544.0

Importing data from HDFS to table
Step1: Creating an empty table in mysql
mysql> use sqoopdb;
mysql> create table employee_export (name varchar(255), salary double(7,2));
Query OK, 0 rows affected (0.06 sec)

mysql> desc employee_export;
+——–+————–+——+—–+———+——-+
| Field  | Type         | Null | Key | Default | Extra |
+——–+————–+——+—–+———+——-+
| name   | varchar(255) | YES  |     | NULL    |       |
| salary | double(7,2)  | YES  |     | NULL    |       |
+——–+————–+——+—–+———+——-+
2 rows in set (0.00 sec)

mysql> select * from employee_export;
Empty set (0.00 sec)

Step2: Creating directory and copy csv to HDFS
$ cat employee.csv
Jack,9878.21
Mark,8754.65

hadoop fs -mkdir -p /user/hduser/export
hadoop fs -copyFromLocal employee.csv /user/hduser/export/employee.csv

Step3: Running export command
$ sqoop export –connect jdbc:mysql://localhost/sqoopdb –username root –password root –table employee_export –export-dir ‘/user/hduser/export’ -m 1

INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/03/25 18:55:19 INFO tool.CodeGenTool: Beginning code generation
15/03/25 18:55:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_export` AS t LIMIT 1
15/03/25 18:55:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee_export` AS t LIMIT 1
15/03/25 18:55:19 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop240
Note: /tmp/sqoop-hduser/compile/6e14d90fb8e22995a61e9be9af177f1a/employee_export.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/03/25 18:55:21 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/6e14d90fb8e22995a61e9be9af177f1a/employee_export.jar
15/03/25 18:55:21 INFO mapreduce.ExportJobBase: Beginning export of employee_export

15/03/25 18:55:23 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/03/25 18:55:25 INFO input.FileInputFormat: Total input paths to process : 1
15/03/25 18:55:25 INFO input.FileInputFormat: Total input paths to process : 1
15/03/25 18:55:25 INFO mapreduce.JobSubmitter: number of splits:1
15/03/25 18:55:25 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/03/25 18:55:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1427265867092_0006
15/03/25 18:55:26 INFO impl.YarnClientImpl: Submitted application application_1427265867092_0006

15/03/25 18:55:34 INFO mapreduce.Job:  map 0% reduce 0%
15/03/25 18:55:41 INFO mapreduce.Job:  map 100% reduce 0%
15/03/25 18:55:41 INFO mapreduce.Job: Job job_1427265867092_0006 completed successfully
15/03/25 18:55:42 INFO mapreduce.Job: Counters: 30
15/03/25 18:55:42 INFO mapreduce.ExportJobBase: Transferred 163 bytes in 18.4814 seconds (8.8197 bytes/sec)
15/03/25 18:55:42 INFO mapreduce.ExportJobBase: Exported 2 records.
———————-
mysql> select * from sqoopdb.employee_export;
+——+———+
| name | salary  |
+——+———+
| Jack | 9878.21 |
| Mark | 8754.65 |
+——+———+
2 rows in set (0.00 sec)

Apache Storm

What is Storm
Apache Storm is a distributed realtime computation system which process unbounded streams of data, doing for realtime processing. Storm is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL are namely few primary use cases that can be addressed via Storm.

STORMs logical overview
At the highest level, Storm is comprised of topologies. A topology is a graph of computations — each node contains processing logic and each path between nodes indicates how data should be passed between nodes. A topology comprises of network of streams which is unbounded sequence of tuples. In short:

  1. Tuple: Ordered list of elements. Eg: (“orange”,”tweet-123″,..,..,..).
    Valid type: String, Integer, byte-array, or you can also define your own serializers so that custom types can be used natively within tuples.
  2. Streams: Unbounded sequence of tuples: Tuple1, Tuple2, Tuple3

Storm uses SPOUTS, which takes an continuous input stream from a source viz. twitter and pass this chunk of data (or emit this stream) to another component called as BOLTS to consume. An emitted tuple can go from a “Spout” to “Bolt” or/and from “Bolt” to another “Bolt”. A Storm topology may have one or more Spouts and Bolts. As an implementer/programmer, multiple spouts/bolts can be configured as per the business logic.

(The above image is from Apache Storm’s website)

STORMs Architecturial overview
Storm run in a clustered environment. Similar to Hadoop, it has two types of nodes:

  1. Master node: This node runs a daemon process called ‘Nimbus’. Nimbus is responsible for distributing code or the toplogy (spouts+bolts) across the cluster, assigning tasks to worker nodes, and monitoring the success and failure of units of work.
  2. Worker node: Worker node has a node called as daemon process called the ‘Supervisor’. A Supervisor is responsible to starts and stops worker processes. Each worker process executes a subset of a topology, so that the execution of a topology is spread across a different worker processes that are running on cluster.

Storm leverages ZooKeeper to maintain the communication between Nimbus and Supervisor. Nimbus communicates to Supervisor by passing messages to Zookeeper. Zookeeper maintain the complete state of toplogy so that Nimbus and Supervisors to be fail-fast and stateless.

Storm mode
1. Local mode: In local mode, Storm executes topologies completely in-process by simulating worker nodes using threads.
2. Distributed mode: Runs across the cluster of machines.

Hadoop for beginners

In this article, we will see a top level insight on Hadoop and its ecosystem.

Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Following are the modules that comprises Hadoop framework:

1. Hadoop Common: contains libraries and utilities needed by other Hadoop modules.
2. Hadoop Distributed File System (HDFS): a distributed file-system, which provides very high aggregate bandwidth across the cluster.
3. Hadoop YARN: a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
4. Hadoop MapReduce: a programming model for large scale data processing.

HDFS

The Hadoop distributed file system (HDFS) is a distributed file system (when running on large clusters of a commodity machine) for the Hadoop framework while providing high scalability,streaming access, throughput and reliability and able to store massive amounts of data for the Single Writer/Multiple Reader operation, while running on large clusters of a commodity machine.

Few HDFS Concept:

  1. Block: A disk has a block size, which is the minimum amount of data that it can read or write. Block size in HDFS are 64 MB by default. Each block is replicated to a small number of physically separate machines (typically three).
  2. NameNode and DataNode: An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (master) and a number of datanodes (workers).
    1. NameNode:  The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.   This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located. One important aspect of NameNode is that, it is a single point of failure.
    2. DataNode: DataNode is a commodity machine (less expensive) to store a large amount of data. It executes all commands driven by NameNode, such as physically creation, deletion, and replication of a block and also does low-level operation for I/O requests served for the HDFS client. By nature, the dataNode is a slave and it sends a heartbeat to NameNode in every three seconds, reporting the health of the HDFS cluster and a block report to NameNode. These block reports contain information regarding which block belongs to which file. DataNode also enables pipelining of data and can be used to forward data to another data node that exists in the same cluster.
    3. Secondary DataNode: This node also known as the CheckPointNode or HelperNode. It’s a separate, highly reliable machine with lots of CPU power and RAM. This node is generally the snapshot of DataNode

Map Reduce without jargons

Map Reduce
In simple terms, a list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of values.

E.g:

[A] dataset1.txt: Hadoop was created by Doug Cutting and Mike Cafarella
[B] dataset2.txt: Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant

This two data-sets [A & B] as an input will get divided into splits. Each split will have a key, value pair. In this case the key will be the offset/line number and the value will be the content of the respective split for that offset/line number. The map function discards the line number and produces a per-line (word, count) pair for each word in the input line.

So the mapper output (from (line number, text) as an input ——> (word, count) pair —-> as an input to map phase):

[(“Hadoop”, 1),(“was”, 1),(“created”, 1),(“by”, 1),(“Dough”,1),(“and”, 1),(“Mike”, 1),(“Cafarella”, 1),(“who”, 1),(“was”, 1),(“working”, 1),(“at”, 1),(“Yahoo”, 1),(“at”, 1),(“the”, 1),(“time”, 1),(“named”, 1),(“it”, 1),(“after”, 1),(“his”, 1),(“son’s”,1),(“toy”,1),(“elephant”,1)]

The output of mapper contains multiple key-value pairs with the same key. So before entering into reducer phase, the map-reduce framework will consolidate all the values for similar key. So the input to the reducer is actually (key, value) pairs. Below is the output from the shuffle phase:

{“Hadoop”: [1] ,”was”: [1,1], “created”: [1], “by”: [1], “Dough”:[1] , “and”: [1], “Mike”: [1], “Cafarella”: [1] ,
“who”: [1], “working”: [1], “at”: [1,1], “Yahoo”: [1], “the”: [1] , “time”: [1], “named”: [1], “it”: [1], “after”: [1], “his”: [1], “son’s”:[1], “toy”:[1], “elephant”:[1] }

Post shuffle, the reducer will take the above consolidate key-value pair input and simply sums up the list of intermediate values and produce the intermediate key and the sum as output:

[(“Hadoop”, 1),(“was”, 2),(“created”, 1),(“by”, 1),(“Dough”,1),(“and”, 1),(“Mike”, 1),(“Cafarella”, 1),(“who”, 1),(“was”, 1),(“working”, 1),(“at”, 2),(“Yahoo”, 1),(“at”, 1),(“the”, 1),(“time”, 1),(“named”, 1),(“it”, 1),(“after”, 1),(“his”, 1),(“son’s”,1),(“toy”,1),(“elephant”,1)]

Apache Pig Quick Tutorial


The following tutorial is about Apache Pig. This is a beginners tutorial which came as a part of self learning.

# Prequisite on Ubuntu:
1. hadoop-1.0.3
2. pig-0.12.0
3. DataSet: Sample youtube dataset (http://netsg.cs.sfu.ca/youtubedata/)

From https://pig.apache.org/:
“Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.”

1. Modes in Apache Pig:
a. Local Mode
– Require single machine;
– To start in local mode you need to specify -x flag (pig -x local).
– Suitable for learning and testing small dataset.

b. MapReduce Mode
– Default Mode (Not required to specify pig -x mapreduce)
– Require access to a Hadoop cluster and HDFS installation.

# Start the Pig in local mode:
hduser@localhost:~$ pig -x local

#Loading dataset into Pig:
grunt> youTube = LOAD ‘/home/hduser/pig_data/utube_mod1.csv’ USING PigStorage(‘,’) as (video_id,uploader,age,category,length,views,rate,ratings,comments);

Here, above the ‘youTube’ is a relation. The ‘PigStorage’ is one of the Load/Store utility. It parses input records based on a delimiter and the fields thereafter can be referenced positionally or via alias.

#describe: describe operator returns the schema for the above relation.
grunt> describe youTube;
youTube: {video_id: bytearray,uploader: bytearray,age: bytearray,category: bytearray,length: bytearray,views: bytearray,rate: bytearray,ratings: bytearray,comments: bytearray}

#dump: dump operator to run (execute) Pig Latin statements interactively and display the results to your screen. Generally dump operator is used for debugging purpose.
grunt> DUMP youTube;

2014-02-19 15:18:45,550 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-02-19 15:18:45,718 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-02-19 15:18:46,091 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-02-19 15:18:46,241 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-02-19 15:18:46,243 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-02-19 15:18:46,422 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-02-19 15:18:46,567 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-02-19 15:18:46,683 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-02-19 15:18:46,691 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Map only job, skipping reducer estimation
2014-02-19 15:18:46,819 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
........................
........................

2014-02-19 15:18:51,518 [Thread-4] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local_0001_m_000000_0' done.
2014-02-19 15:18:51,953 [main] WARN  org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for job job_local_0001
2014-02-19 15:18:51,974 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-02-19 15:18:51,974 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete
2014-02-19 15:18:52,002 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
1.0.3	0.12.0	hduser	2014-02-19 15:18:46	2014-02-19 15:18:51	UNKNOWN

Success!

Job Stats (time in seconds):
JobId	Alias	Feature	Outputs
job_local_0001	youTube	MAP_ONLY	file:/tmp/temp950919397/tmp-2040316079,

Input(s):
Successfully read records from: "/home/hduser/pig_data/utube_mod1.csv"

Output(s):
Successfully stored records in: "file:/tmp/temp950919397/tmp-2040316079"

Job DAG:
job_local_0001

2014-02-19 15:18:52,029 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-02-19 15:18:52,043 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2014-02-19 15:18:52,059 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-02-19 15:18:52,062 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(D6frFp-VwHs,yetube,821,Entertainment,30,554455,3.54,2813,422)
(0Lg4i2C6zws,TNAwrestling,821,Sports,573,191461,4.46,217,111)
(UJpgxqYGws4,jrc0803,820,Sports,55,160852,4.09,486,423)
(BAPwg5nCKxE,milanoss,820,Film & Animation,578,170536,4.06,82,91)
(vVJ06ixj19Q,PimpimusPrime,820,Film & Animation,29,95950,4.61,134,185)
(UbhEunreGwQ,milanoss,820,Film & Animation,525,113422,4.06,48,44)
(sR2n3_fg-bY,koushibom,821,News & Politics,94,70136,3.79,38,36)
(ZB-MtI2sgP4,hotelcalifornians,820,Autos & Vehicles,77,77178,4.13,142,148)
(n-cLsNrL6W8,ganggeneral,820,Music,200,72386,3.55,279,279)
(KsL1F4HFxv0,deej240z,821,Entertainment,17,39270,4.51,41,60)
..........

# List the youtube videos having ‘rate’ greater than 3:
grunt> rate_more_than_three = FILTER youTube BY (float) rate>3.0;
2014-02-19 15:31:22,714 [main] WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).

grunt> DUMP rate_more_than_three;

grunt> DUMP rate_more_than_three;
................
(D6frFp-VwHs,yetube,821,Entertainment,30,554455,3.54,2813,422)
(0Lg4i2C6zws,TNAwrestling,821,Sports,573,191461,4.46,217,111)
(UJpgxqYGws4,jrc0803,820,Sports,55,160852,4.09,486,423)
(BAPwg5nCKxE,milanoss,820,Film & Animation,578,170536,4.06,82,91)
(vVJ06ixj19Q,PimpimusPrime,820,Film & Animation,29,95950,4.61,134,185)
(UbhEunreGwQ,milanoss,820,Film & Animation,525,113422,4.06,48,44)
(sR2n3_fg-bY,koushibom,821,News & Politics,94,70136,3.79,38,36)
(ZB-MtI2sgP4,hotelcalifornians,820,Autos & Vehicles,77,77178,4.13,142,148)
..........................

#Storing output to file:

grunt> store rate_more_than_three into '/home/hduser/rate_more_than_three';
..............
Success!

Job Stats (time in seconds):
JobId	Alias	Feature	Outputs
job_local_0003	rate_more_than_three,youTube	MAP_ONLY	/home/hduser/rate_more_than_three,

Input(s):
Successfully read records from: "/home/hduser/pig_data/utube_mod1.csv"

Output(s):
Successfully stored records in: "/home/hduser/rate_more_than_three"

#Youtube video length >=500 AND <=1000
grunt> length_between_500_1000 = FILTER youTube by length >=500 AND length <=1000;

0Lg4i2C6zws,TNAwrestling,821,Sports,573,191461,4.46,217,111)
(BAPwg5nCKxE,milanoss,820,Film & Animation,578,170536,4.06,82,91)
(UbhEunreGwQ,milanoss,820,Film & Animation,525,113422,4.06,48,44)
(V__TtNHKXLU,Orkunyk,821,Film & Animation,599,27399,4.87,104,31)
(_IDfKKWBEZk,pundital,820,News & Politics,573,40935,4.76,155,215)
(hOgvS9c5Kz0,Orkunyk,821,Film & Animation,535,20141,5,74,16)
(doKkOSMaTk4,berkeleyguy0,821,News & Politics,585,20025,4.85,229,315)
(H23vitezN2E,vinnicamara,821,Entertainment,542,19689,4.7,74,85)
(fCDZnp4Pv4g,SoftAnime,820,Film & Animation,544,18110,4.44,50,30)
(koI4vN8Qosw,vinnicamara,821,Entertainment,533,17071,4.77,62,31)
(KY7MdPMyuhA,chriseliterbd,820,Entertainment,558,14845,4.99,130,27)
(RoFq0Be-6q0,DiziTube,821,Entertainment,596,10532,4.96,142,16)
(frvIaUVPN6I,TNAwrestling,820,Sports,894,123172,4.5,159,69)

#ILLUSTRATE: The ILLUSTRATE operator is used to examine how data is transformed through provided Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.
grunt> illustrate length_between_500_1000;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| youTube     | video_id:bytearray    | uploader:bytearray    | age:bytearray    | category:bytearray    | length:bytearray    | views:bytearray    | rate:bytearray    | ratings:bytearray    | comments:bytearray    | 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|             | M2aZoFm4RVI           | brettkeane            | 821              | People & Blogs        | 693                 | 951                | 4.07              | 183                  | 75                    | 
|             | 53JV_3QR8f4           | itslate2              | 820              | Entertainment         | 369                 | 9549               | 4.85              | 130                  | 111                   | 
|             | hxHjWYA50Ds           | Politicstv            | 819              | News & Politics       | 1204                | 36725              | 4.99              | 290                  | 133                   | 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| length_between_500_1000     | video_id:bytearray    | uploader:bytearray    | age:bytearray    | category:bytearray    | length:bytearray    | views:bytearray    | rate:bytearray    | ratings:bytearray    | comments:bytearray    | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                             | M2aZoFm4RVI           | brettkeane            | 821              | People & Blogs        | 693                 | 951                | 4.07              | 183                  | 75                    | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#GROUP BY:
grunt> grouped_by_category = group youTube by category;
grunt> dump grouped_by_category

( UNA ,{(O7pBhfJnoYw,oodashi,0, UNA ,276,69303,4.06,52,36)})
(Music,{(Cvu373qjj7A,NeedToBe2007,820,Music,304,14330,4,21,21),(i5QF8-JmVm8,TIinstoresjuly3rd,820,Music,307,27636,4.71,163,134),(HXCWXLvJumo,TheKlasix,820,Music,164,20410,4.69,13,25),(jdkyugXj9Pg,brandt4797,820,Music,200,31671,4.54,172,141),(EKXR3zmpnvY,TIvsTIP,821,Music,32,11827,4.82,11,15),(gbDD5HLwifY,myhumpsfergie123456,819,Music,304,384435,3.6,1227,962),(pZnWxzFMCHg,AvrilLavigneSucks,819,Music,304,226893,4.51,579,465),(b2N27e79EdY,VeGaS1004,821,Music,564,7348,4.71,35,119),(uZtjiITa8FE,AllHipHopcom,820,Music,191,55069,3.34,119,127),(NnSk9zv6M6w,BuyCurtisJune26th,819,Music,189,88931,4.06,272,374),(lyfkVooQnmo,alanyukfoo,820,Music,197,11571,5,21,12),(h8y1qL_HLQE,Tam1r,820,Music,201,122334,4.68,344,324),(SKovbXL7xSI,edenmaze,821,Music,265,207,4.32,19,79),(T_SDHfswB30,davidchoimusic,821,Music,139,2694,4.77,192,105),(n-cLsNrL6W8,ganggeneral,820,Music,200,72386,3.55,279,279),(amH3qAu9RA4,blink182this,820,Music,135,17376,4.74,115,179),(X0E6Sf2vOwI,toprank144,821,Music,249,649,4.89,112,8),(oIfEw4e-0Yg,sleepingbeautywood,820,Music,223,57706,4.68,579,359),(IBM8VOpC6Mg,Brookers,820,Music,140,17018,4.31,548,348),(_7KUs1TBkCc,hollywoodrecords,820,Music,199,28640,4.55,65,31),(Gp8WFljRHsQ,ysabellabrave,821,Music,223,17005,4.58,697,344)})
(Comedy,{(RGc-F_0Y7tM,infrastructurdeep,820,Comedy,254,313693,4.38,1571,1626),(6l7P_dZQtac,jarts,820,Comedy,17,287000,4.67,466,304),(JqOeDW5wu_E,earthbrightfuture,819,Comedy,15,235198,3.59,343,572),(Ryeyee-4tw4,nalts,822,Comedy,316,222,4.66,148,133),(yDJAIT6Y5dI,correoyahoo,821,Comedy,34,33895,3.93,44,34),(Y69y0LSgMxk,mrpregnant,822,Comedy,183,1342,4.27,75,346),(8NPwJu42aUo,moonkeyz,819,Comedy,32,174327,1.99,1561,2243),(RzkEsSHOxQU,marcmaron101,820,Comedy,280,73232,4.52,607,101),(mY5qJHZCz2I,DraX360,818,Comedy,88,160177,4.79,358,174),(QBqkPAtKQN8,THANOS43,822,Comedy,246,681,4.62,8,81),(D3Lr70lwaVg,rawful,820,Comedy,25,13313,4.83,103,42),(AVnvjE9ixZs,potlot14,821,Comedy,583,1377,4.91,100,30),(HkRUHugg8vU,jrc0803,819,Comedy,107,131531,3.72,990,1148),(Y9NgXIkyiwk,AgentXPQ,822,Comedy,108,1802,4.73,139,95),(ZcTFgT3Ufw8,nalts,820,Comedy,229,14892,4.76,691,407),(H2XgxvuB2zY,brettkeane,822,Comedy,220,430,3.08,105,82),(bOYsI7f90R0,galipoka,821,Comedy,77,2389,4.54,161,160),(WFqqPK7kN0c,morbeck,821,Comedy,210,624,4.27,62,84),(lx9WAc29kHA,MasakoX,822,Comedy,251,3496,4.9,383,379),(Ub9JGlr9ok4,KnoxsKorner1,820,Comedy,379,1134,4.81,90,96),(2shafTX1KF8,TheRealParis,819,Comedy,135,19030,4.31,487,428),(cMKvJRRgAVY,MMASK,821,Comedy,110,566,4.94,100,6),(o7zDjdsOasg,brettkeane,821,Comedy,675,712,4.25,157,57),(5ml65ZRBVVc,selectscreen,821,Comedy,77,975,4.89,224,9),(P0TCzE4HTZQ,HappySlip,822,Comedy,150,8911,4.74,663,416),(Dni7fBgm_iA,swiftkaratechop,820,Comedy,170,1135,4.89,64,128),(rmeMGbJvu9E,JamesNintendoNerd,822,Comedy,54,6735,4.81,274,135),(Rl3rEL8AlbI,Brookers,820,Comedy,70,15119,4.05,465,347),(E22gUUTG2VI,MarkDayComedy,821,Comedy,246,6799,4.84,946,148),(N4yfFAIR9kc,beebee890,820,Comedy,22,8209,2.07,69,97),(jf8zyO8UrLc,khayav,821,Comedy,34,1121,4.63,87,115),(R1lyKwlvkns,freemovies125,820,Comedy,98,28624,4.54,82,20),(4kHKSZvJscE,lonelygirll15,822,Comedy,186,202,4.98,112,6),(i8K4WFTzjtE,potlot14,821,Comedy,474,1290,4.92,113,37),(0Cbzow5FsKQ,TheWoodcreekFaction,821,Comedy,106,1365,4.45,121,90),(XAs68-oHKhA,guywiththeglasses,821,Comedy,22,2518,4.56,151,145),(73om6gY8XSA,nogoodtv,820,Comedy,442,644674,2.55,793,311),(pZiQBoG8K8E,nalts,821,Comedy,210,56230,3.84,887,469)})
(Sports,{(UJpgxqYGws4,jrc0803,820,Sports,55,160852,4.09,486,423),(pzgEJh6stMA,NBA,821,Sports,118,34627,4.34,61,57),(dh7Xxxhhr_8,NBA,821,Sports,126,25750,4.56,62,53),(UoAXg2pOG5s,jukimol,820,Sports,230,9010,3.7,70,91),(kaAdnLOuoAY,europeansportservice,820,Sports,333,10301,4.8,25,13),(LBBPoyXEOhM,maxpower1453,821,Sports,54,8927,4.76,504,169),(0Lg4i2C6zws,TNAwrestling,821,Sports,573,191461,4.46,217,111),(s7ZMRZCW9Xc,HockeyPacific,820,Sports,44,49214,2.98,141,591),(BgtJwf4dK1A,empiricalred,815,Sports,600,27387,4.88,283,127),(foquonPZSN8,BgirL5,819,Sports,45,48810,3.23,183,1097),(frvIaUVPN6I,TNAwrestling,820,Sports,894,123172,4.5,159,69),(wWrEEKNjIAY,TNAwrestling,821,Sports,223,11547,4.54,59,42),(mjmbJm7idkI,NBA,821,Sports,24,21217,4.59,32,71),(CgLYZHN78Kk,ProductZero,820,Sports,17,73897,4.6,239,358),(yDbX8I202VU,medakaschool,820,Sports,74,529826,4.53,1347,1145)})
(Howto & DIY,{(0S1nAcs772s,hortononon,819,Howto & DIY,46,119519,1.72,756,617),(TWTyJOsMmio,pigslop,822,Howto & DIY,468,1010,0,0,171),(xGcNpfT2L0s,diethealth,820,Howto & DIY,221,115003,2.74,91,82),(80TAx2mCnNc,oodashi,818,Howto & DIY,481,68664,4.15,26,14),(05oT4ejpYGQ,FreePSP,820,Howto & DIY,36,3533,1.67,18,221),(2IAEu6GtDWM,mikeskehan,821,Howto & DIY,140,2858,3.01,142,473)})
(Entertainment,{(IEWBO8xZzpo,DiziTube,821,Entertainment,596,3953,4.94,87,9),(5OaFXetu46Q,NBC,821,Entertainment,266,2452,4.68,92,19),(Sc33LPbqdl8,R3NDI3R,821,Entertainment,337,1782,3.15,140,170),(o_sz0NvQfLc,jakedeherrera,821,Entertainment,111,1004,4.89,81,59),(UbD7MG_j_UI,warren25smash,822,Entertainment,178,2252,3.15,200,202),(ut-RQaJum1c,DoctorArzt,820,Entertainment,29,30109,4.95,38,47),(iXT2E9Ccc8A,kembrew,819,Entertainment,561,46150,4.88,329,143),(MHor7QwhLY0,leorai,821,Entertainment,40,25229,2.75,8,7),(JX7zrbrnaOI,NOWWUT,821,Entertainment,59,19791,3.31,32,69),(s7uXvqfQvNI,TygerClaw,821,Entertainment,152,39929,4.87,180,188),(kxvT_F8GYV4,DoctorArzt,820,Entertainment,29,18794,4.44,9,8),(_idZr95SEAM,lostpromos,820,Entertainment,30,20441,5,24,12),(FXZBLli0_sA,GayGod,820,Entertainment,205,28006,3.53,337,575),(fu_GNJwUnxM,DiziTube,821,Entertainment,393,15614,4.92,354,127),(sYv8vzKKDVs,GWrocks09,820,Entertainment,98,18912,4.87,63,125),(JNiGret7EW4,aat08,821,Entertainment,52,29449,2,50,52),(_Z1X9zpBe_A,DharmaSecrets,820,Entertainment,30,16757,4.81,16,12),(DHDCITa7RyA,GWrocks09,820,Entertainment,153,17171,4.36,36,140),(KsL1F4HFxv0,deej240z,821,Entertainment,17,39270,4.51,41,60),(TJolUxvL3sQ,chriseliterbd,820,Entertainment,369,18302,5,125,18),(tXinnBzRSzg,txvoodoo,820,Entertainment,22,13673,4.5,16,19),(A5d-7KFINqM,vinnicamara,821,Entertainment,459,19187,4.84,74,34),(yZDNwXle154,ootpmovievids,821,Entertainment,30,47213,4.81,58,29),(-4R-4Q7vppg,HockeyCrazyRrazy,820,Entertainment,156,13572,4.45,29,124),(eNQidcorW_g,chriseliterbd,820,Entertainment,424,18506,4.87,124,17),(DE2O9CtuU5o,sundancechannel,821,Entertainment,73,21788,4.41,73,124),(H23vitezN2E,vinnicamara,821,Entertainment,542,19689,4.7,74,85),(G3bz3ZVjcEU,TVGuy88,820,Entertainment,123,11920,3.55,31,185),(koI4vN8Qosw,vinnicamara,821,Entertainment,533,17071,4.77,62,31),(KY7MdPMyuhA,chriseliterbd,820,Entertainment,558,14845,4.99,130,27),(#NAME?,chriseliterbd,820,Entertainment,395,14772,5,118,20),(oJWaEPxfgW8,Danoramma,820,Entertainment,133,11404,3.97,32,152),(785FtaTTezo,daisytree1,821,Entertainment,64,24954,3.15,54,43),(RoFq0Be-6q0,DiziTube,821,Entertainment,596,10532,4.96,142,16),(MxR99EdoZmE,chriseliterbd,820,Entertainment,183,14263,4.97,86,13),(o57wrk4mKvM,DharmaSecrets,820,Entertainment,29,10448,4.6,10,34),(UaxZ63N0rRk,elfactorx,820,Entertainment,140,14235,4.79,14,16),(6J2mr5zKgsk,vinnicamara,821,Entertainment,422,10164,4.88,58,50),(tKKLUIOpvSE,CelebTV,820,Entertainment,45,197144,1.87,587,221),(bOfAPHJTagY,BANGOUTCLIQUECOM,819,Entertainment,17,164822,4.68,638,928),(kMg0gGaMe0Q,tibermedia,819,Entertainment,235,187497,4.73,402,323),(izk3aYS9FIA,truefaithisle,820,Entertainment,61,138296,4.31,552,490),(H85ZaC7utus,sierraforest,816,Entertainment,27,96953,4.44,39,29),(20M8Kf0wqZg,yummyum07,819,Entertainment,304,91578,4.1,224,134),(bWJshfx9bTw,blahinsider,819,Entertainment,138,88262,3.93,190,289),(HiHdJ426_2k,goaltaker1,820,Entertainment,236,84088,4.46,240,291),(eQUAAwNJtg0,YTwatchdog,820,Entertainment,162,2391,4.64,133,568),(D6frFp-VwHs,yetube,821,Entertainment,30,554455,3.54,2813,422),(UAJctmZaLgY,tibermedia,818,Entertainment,407,87416,4.6,113,100),(KuYiFwTsFiw,DiziTube,821,Entertainment,599,10153,4.91,232,15),(UoubyKe9Xa0,DiziTube,821,Entertainment,599,9466,4.96,213,54),(_h-DUe3o4j4,peron75,820,Entertainment,280,5721,4.8,228,192),(RJIxWE2qDsk,DiziTube,821,Entertainment,599,9032,4.96,201,21),(mEDsATTpgGk,DiziTube,821,Entertainment,598,8823,4.99,182,20),(IuGyQRdPP5c,DiziTube,821,Entertainment,594,10151,4.95,183,16),(ZOQU1YP4SsE,DiziTube,821,Entertainment,597,10008,4.92,180,14),(boSrjupyOlw,peron75,821,Entertainment,345,1836,4.87,159,85),(53JV_3QR8f4,itslate2,820,Entertainment,369,9549,4.85,130,111),(scm9nRm8tMY,DeltaDJ2006,821,Entertainment,79,260,5,120,0),(IXFjHlaz1J0,DiGiTiLsOuL,821,Entertainment,303,1937,4.84,128,119),(rlom5CERakI,chriseliterbd,820,Entertainment,272,11151,4.97,110,11),(GwtkWMZbQHk,hoiitsroi,821,Entertainment,192,5451,4.56,126,156),(IT1rSaGmUDQ,chriseliterbd,820,Entertainment,211,13355,4.96,103,28),(FMgKXKsZN90,WHATTHEBUCKSHOW,821,Entertainment,498,2571,4.78,110,99),(ShbEBBcvtBc,chriseliterbd,820,Entertainment,395,11941,4.91,101,7),(rayY8wvYD08,vinnicamara,821,Entertainment,584,9532,4.81,80,157)})
(People & Blogs,{(i18uFHYsUvo,brettkeane,820,People & Blogs,897,729,4.39,143,74),(icd4MgHPOno,brettkeane,821,People & Blogs,1141,827,4.28,149,72),(ZPoWU65NszY,billybigun64,821,People & Blogs,528,746,4.99,80,64),(#NAME?,nickynik,822,People & Blogs,394,397,4.84,186,102),(ZWFtVnqMFu8,Daxflame,821,People & Blogs,236,24769,3.44,882,1472),(caC_fGJT-SM,Blunty3000,821,People & Blogs,247,2664,3.97,152,143),(hEHMhMjlDek,ren4165,820,People & Blogs,183,410,4.56,9,86),(OHCcVlRsllc,smpfilms,821,People & Blogs,164,7430,4.29,219,342),(QW7pxFBBjaU,rickyste,820,People & Blogs,13,13375,4.59,403,620),(mkoh0eXnnf0,kicesie,821,People & Blogs,44,626,3.91,11,125),(C1vZzxyVhV8,Zipster08,821,People & Blogs,245,2072,4.64,163,132),(SD--8k_IsQo,SoldierInGodsArmy,821,People & Blogs,473,480,4.85,141,71),(NkqbDeuKXNk,goldengun85,821,People & Blogs,372,622,1.94,48,96),(#NAME?,blacktreemedia,821,People & Blogs,129,51705,4.62,99,173),(8TwvvOC8vdU,theboringdispatcher,821,People & Blogs,301,1201,4.28,68,80),(NUqEVO_C5ss,applemilk1988,820,People & Blogs,279,35025,4,430,823),(bp_TNrl8xc4,brettkeane,821,People & Blogs,364,464,4.49,99,37),(iaU9puwzMkE,hoiitsroi,821,People & Blogs,420,4726,4.72,127,230),(FiSddAJNIoc,brettkeane,821,People & Blogs,903,636,4.44,147,41),(P4im8gGPdGM,karpmax,821,People & Blogs,62,19973,4.18,11,15),(c5-TCNHSPkk,YourTubeNEWS,821,People & Blogs,341,687,3.11,114,77),(M2aZoFm4RVI,brettkeane,821,People & Blogs,693,951,4.07,183,75),(HAt8hmTNVbY,biostudentgirl,819,People & Blogs,160,124923,4.05,633,312),(O6_oXxTWHmo,mushcul,817,People & Blogs,83,52239,4.45,106,514),(CN-rHMWlB4w,soccerstar4ever,820,People & Blogs,192,45006,2.95,187,219),(c2wkbdBprDw,ashleytisdale,821,People & Blogs,44,24717,4.55,446,434),(6P1IR84LwI4,hydroax,820,People & Blogs,15,18533,1.8,5,4),(CySrshUMwIw,blacktreemedia,821,People & Blogs,242,78879,4.08,216,349),(IvpaBrX52pM,MissMalena13,821,People & Blogs,177,1707,4.95,387,21),(hoVr6iIKj_c,communitychannel,822,People & Blogs,206,7771,4.78,362,327),(7mMw9TdZLxI,khriskhaos2,821,People & Blogs,0,1043,0,0,114),(ilt0rxr5gQk,dramatubearchive,820,People & Blogs,110,3617,3.64,45,812),(T9FJIYBWdTQ,zakgeorge21,822,People & Blogs,638,590,4.75,83,107),(Gs87lI02tek,renetto,820,People & Blogs,646,8516,4.36,448,650),(TCMpVv87g6E,TheAmazingAtheist,821,People & Blogs,913,2805,3.79,486,168),(ezgk4QXwIhY,xgobobeanx,821,People & Blogs,112,1075,4.61,88,95),(E2YTdBsEnro,spricket24,821,People & Blogs,578,3676,4.59,261,228)})
(Pets & Animals,{(T7NpsCWvjzg,HellionExciter,821,Pets & Animals,289,1670,4.21,103,128),(s2ymS4fmjGQ,Padovarulezcom,821,Pets & Animals,88,10925,0,0,0)})
(Gadgets & Games,{(tOF8-Z7yQ_U,joystiq,820,Gadgets & Games,88,10991,4.47,30,25),(jDdtzpmVb1U,UrinatingTree,821,Gadgets & Games,424,1764,4.75,89,0),(s8hRgb0WTt4,ostekakepstsnet,821,Gadgets & Games,415,1135,4.84,108,39),(w867ePtiaZI,UrinatingTree,821,Gadgets & Games,546,1669,4.79,95,93),(xTkLDGtu36Q,Marriland,821,Gadgets & Games,348,2214,4.93,70,131),(Wq3laSZAT7U,spritefan2,821,Gadgets & Games,180,800,4.64,44,98),(R07C3wfft_4,HotPinkMidNite,820,Gadgets & Games,104,1278,4.92,84,83)})
(News & Politics,{(3UrumrPlFiU,cpotato2004,815,News & Politics,605,2931,3.98,62,409),(sR2n3_fg-bY,koushibom,821,News & Politics,94,70136,3.79,38,36),(xcQQ05XtAQ4,RonPaul2008dotcom,821,News & Politics,295,47309,4.89,1263,736),(CY86R1qjgDc,CBS,821,News & Politics,123,61187,3.8,122,12),(shoyObDet4Q,lbracci,820,News & Politics,1529,24533,4.24,38,67),(_IDfKKWBEZk,pundital,820,News & Politics,573,40935,4.76,155,215),(rF3NtEWj6ws,VoteRonPaul08,820,News & Politics,363,24460,4.9,343,292),(R0HEKTr6wrc,tpmtv,821,News & Politics,67,27474,4.61,84,90),(j_qUvgfzuPM,bgcaplay,820,News & Politics,312,14986,2.57,177,78),(ceIXPrfuGxg,puratrampa,820,News & Politics,110,16138,2.29,7,7),(doKkOSMaTk4,berkeleyguy0,821,News & Politics,585,20025,4.85,229,315),(T_VC8iH7lXQ,serkanserkanserkannn,821,News & Politics,40,13052,3.4,5,2),(YPbr5L4ByfE,koushibom,821,News & Politics,415,15045,3.33,3,11),(3FV7XU-TLMU,hillaryclintondotcom,820,News & Politics,53,277309,2.8,2038,50),(vJRDZE5xW2Y,ecogeeky,816,News & Politics,114,169722,4.71,17,15),(YkAPaEMwyKU,TRUEADONIS,819,News & Politics,292,166576,4.76,1423,1368),(Sy4Eugc0Xls,karlspackler,820,News & Politics,363,87170,4.89,1723,1408),(J8oO_OD3PtI,pumaman1,820,News & Politics,472,7942,4.86,326,177),(D6SfmXigHpE,suntereo,821,News & Politics,601,19941,4.87,285,284),(MvID-e_irz4,warren25smash,821,News & Politics,314,1743,4.52,220,166),(PvrrPCkHKLw,aravoth,821,News & Politics,237,8010,4.93,136,60),(LMO3Cg0frB8,brettkeane,821,News & Politics,808,474,4.2,117,47),(Q5VeaUW12pY,MiddleClassified,820,News & Politics,476,14571,4.98,412,179),(hxHjWYA50Ds,Politicstv,819,News & Politics,1204,36725,4.99,290,133),(DTHOgoGBUkw,michellemalkin,821,News & Politics,302,4655,4.14,74,187),(Y_r8mMCsSHA,onedeaddj,821,News & Politics,312,3663,1.79,136,177),(Yj1wext0CQc,ResurrectionOfCG,821,News & Politics,391,370,3.88,43,94),(AF_frpUoMIg,thereaganite84,820,News & Politics,31,289,5,10,85),(xP_2M5DDAuM,joshallem,820,News & Politics,583,495,0,0,83),(sk334TbliaY,GalacticCabaret,819,News & Politics,169,10138,4.91,141,645),(Hc1ohELwjWo,lzpoint1944,819,News & Politics,60,23672,2.85,177,463),(KNz0pta4PVU,VoteRonPaul08,819,News & Politics,301,24500,4.81,270,435)})
(Travel & Places,{(vPtsS4UTuis,sxephil,821,Travel & Places,260,12642,4.02,119,86)})
(Autos & Vehicles,{(ZB-MtI2sgP4,hotelcalifornians,820,Autos & Vehicles,77,77178,4.13,142,148),(DjPfzKbYDXw,lampesuda,818,Autos & Vehicles,48,83313,4.31,77,117),(9OLYrVcjmzM,casperjello,820,Autos & Vehicles,2,21754,4,37,25),(po15iWrv9zs,tellible,820,Autos & Vehicles,141,30703,4.17,100,621),(c0QQ3mqJEIk,jufanet,820,Autos & Vehicles,156,38698,3.66,110,485),(9GCUEamZRSs,casperjello,820,Autos & Vehicles,2,16866,4.11,27,23),(rCvifZybyLg,autoknipsfix,821,Autos & Vehicles,91,36347,4.2,84,82)})
(Film & Animation,{(#NAME?,ChappiRukia,821,Film & Animation,232,1487,4.87,163,96),(nlWOu9FHm-I,truefaithisle,820,Film & Animation,250,212185,4.62,1154,1713),(QdTZQkfkG0Q,animeswordns14s1,821,Film & Animation,252,18277,4.87,94,13),(fCDZnp4Pv4g,SoftAnime,820,Film & Animation,544,18110,4.44,50,30),(v74fPO89L0U,Orkunyk,821,Film & Animation,398,12588,4.93,55,12),(QsLKiPX9Kn4,Orkunyk,821,Film & Animation,389,13732,4.84,74,12),(6vzQ7fCTqzU,Orkunyk,821,Film & Animation,406,13990,4.92,49,8),(N0p17EQaUsA,Orkunyk,821,Film & Animation,457,20447,4.93,68,12),(QSeKvw7KnN4,Orkunyk,821,Film & Animation,159,14780,4.79,80,52),(vVJ06ixj19Q,PimpimusPrime,820,Film & Animation,29,95950,4.61,134,185),(D2FXCczEYCI,Orkunyk,821,Film & Animation,398,21848,4.91,66,16),(wBbQKSvQ5h4,Orkunyk,821,Film & Animation,380,16006,4.98,61,7),(_S2ODLHD_xk,Orkunyk,821,Film & Animation,359,22643,4.97,68,10),(BAPwg5nCKxE,milanoss,820,Film & Animation,578,170536,4.06,82,91),(MYSuY2UIj9U,Orkunyk,821,Film & Animation,419,16339,4.92,64,6),(CndugLgU02U,Orkunyk,821,Film & Animation,447,24484,4.85,86,14),(8OkkJEt52HM,bountyTR,821,Film & Animation,26,19644,4.86,7,13),(hOgvS9c5Kz0,Orkunyk,821,Film & Animation,535,20141,5,74,16),(V__TtNHKXLU,Orkunyk,821,Film & Animation,599,27399,4.87,104,31),(S84s5OQIKg8,macpulenta,819,Film & Animation,235,24248,4.89,432,164),(UbhEunreGwQ,milanoss,820,Film & Animation,525,113422,4.06,48,44),(XKwaYhOnwRg,cankiriklari,821,Film & Animation,600,4427,4.96,80,14),(UWLYfu04-RM,OneMJDN,821,Film & Animation,600,479,4.94,90,33),(Iy3MkWs1-PI,Samsunlu1,820,Film & Animation,618,4947,4.95,91,13),(PNQwOuTpgLM,Samsunlu1,820,Film & Animation,617,4005,4.9,97,11),(TIs5FiOG2ho,Samsunlu1,820,Film & Animation,615,4551,4.91,104,7)})
grunt>