A thread is a sequence of instructions that can be executed independently by a computer's central processing unit (CPU). A thread refers to a lightweight unit of execution within a process.
Processes, in turn, are instances of programs running on a computer. Each process has its own memory space and resources, and multiple processes can run concurrently on a computer system. Threads share the same memory space and resources within a process.
Fig: Relation between Thread and Process
Thread programming is a technique that allows multiple threads of execution to run concurrently within a single process. In cloud programming, thread programming refers to the practice of utilizing threads within a program to achieve concurrency and parallelism. Thread programming involves creating and managing threads to perform parallel processing and improve performance, scalability, and responsiveness of cloud-based applications. It enables different parts of an application to execute simultaneously, allowing for efficient utilization of computer resources and faster completion of tasks.
A thread does not maintain a list of created threads, nor does it know the thread that created it. All threads within a process share the same address space. Threads in the same process share process instructions and data, descriptors, signals and signal handlers, current working directory, user and group id. Each thread has unique thread ID, set of registers, stack pointer, stack for local variables, return addresses, signal mask, priority, return value.
Multithreading
Multithreading is a capability or execution model where a single process is divided into multiple threads that can run simultaneously (concurrently or in parallel). It is a way to introduce parallelism in the system or program.
Uses
Processing of large data where it can be divided into parts and get it done using mulitple threads.
Applications which involve mechanism like validate and save, produce and consume, read andl validate are done in multiple threads. Eg: online banking, recharges etc.
Can be used to make games whrer different elements are executing on different threads.
In web applications, it is used to perform asynchronous calls.
In Android, it is used to hit the APIs which are running in the background thread to save the app from stopping.
Advantages
Economical as they share the same processor resource. It takes less time to create threads.
Resource Sharing, allows to share resources like data, memory, files etc. Therefore, an app can have multiple threads within the same address space.
Responsiveness.
Scalability, increases parallelism on multiple CPU machines.
Enhances the performance of multi-processor machines.
Makes the CPU usage better.
How thread programming is done? (Lifecycle of Threads)
Thread programming involves creating and managing multiple threads within a program to achieve concurrent execution of tasks. In general, thread programming is done as follows:
Thread Creation
Threads are typically created by the operating system or thread management library provided by the programming language.
Thread Synchronization
Threads access shared resources, so thread synchronization mechanisms are required to prevent conflicts and ensure data integrity. Common synchronization techniques include locks, mutexes, semaphores, and condition variables.
Thread Execution
The scheduler determines the order and duration of execution for each thread based on factors such as priority, time slicing, and the availability of resources.
Thread Communication
Threads can communicate through mechanisms such as shared memory, message passing, etc.
Thread Termination
Threads can be terminated explicitly by the program or automatically by the operating system when they have completed their tasks.
Task Programming
Task programming is an approach to parallel programming where tasks, rather than threads, are the fundamental unit of work. Tasks represent individual units of work that can be executed independently and concurrently, allowing for efficient utilization of resources. In task programming, we typically define tasks and their dependencies, and a task scheduler manages the execution of these tasks. The scheduler automatically determines the order of task execution based on their dependencies and availability of resources.
A task identifies one or more operations that produce a distinct output and that can be isolated as a single logical unit. In practice, a task is represented as a distinct unit of code, or a program, that can be separated and executed in a remote run time environment. Applications are seen as collections of independent or dependent tasks (e.g., data processing, API calls) that can run separately. Algorithms efficiently assign tasks to available Virtual Machines (VMs) or resources to minimize completion time and cost.
Following are the key concepts and steps involved in the task programming:
Task Definition:
Tasks represent discrete units of work that can be executed independently. Tasks can be defined using language-specific constructs or libraries that provide task parallelism support.
Task Dependencies:
Tasks can have dependencies on other tasks or task minute; the output of another task is input before it can start its execution.
Task Scheduler:
The task scheduler is responsible for managing the execution of tasks. It analyzes the task dependencies and schedules the tasks for execution.
Advantages:
No explicit threading; tasks are created, not the threads.
Schedules the tasks to the processors so that the code scales to the number of available processors.
Multiple algorithms written using task programming model can run at the same time without significant performance impact.
Complex problems can be solved without requiring traditional synchronization tools such as mutexes and conditional variables.
Task functions are simple.
Map-Reduce Programming
MapReduce programming is a model and framework that is used for processing and analyzing large-scale datasets in a distributed computing environment, typically in the cloud. It was first introduced by Google and has become popular for its ability to handle massive amounts of data efficiently.
A MapReduce job typically divides the incoming data set into distinct pieces that are processed in parallel by the map jobs. The framework sorts the map outputs, which are subsequnetly fed into the reduction jobs. Typically, both the job's input and output are saved in a file system. The framework manages task scheduling, task monitoring, and task re-execution.
In the MapReduce programming model, computations are divided into two main stages: the Map stage and the Reduce stage.
Map stage: In this stage, the input data is divided into smaller chunks and processed independently by multiple Map tasks. Each Map task applies a map function to the input data chunk and generates a set of intermediate key-value pairs.
Reduce stage: In this stage, the intermediate key-value pairs are processed by Reduce tasks. Each Reduce task applies a reduce function to the grouped intermediate data. The reduce function combines the values associated with each key, performing operations like aggregation, summarization, or generating the final output.
MapReduce is often used for large-scale data processing tasks like batch processing, log analysis, web indexing, and data transformations. It provides a scalable and fault-tolerant approach to big data processing. Various implementations of MapReduce exist, including Apache Hadoop and Apache Spark.
Advantages of MapReduce:
It is fault-tolerant, that is, it can handle the failure.
Each node sends its status update to master node regularly.
MapReduce processes GBs of unstructured huge quantities of data in a matter of minutes.
Scalable approach to big data processing.
Map Reduce Process
Fig: Map Reduce word count process (example)
Input:
The raw data stored in a distributed file system (e.g., HDFS) that is to be processed.
Splitting:
The input data is divided into smaller fixed-size chunks (input splits) so they can be processed in parallel.
Mapping:
Each split is processed by a mapper to generate intermediate key–value pairs.
Shuffling:
Intermediate key–value pairs are transferred, grouped, and routed so that all values with the same key go to the same reducer.
Reducing:
The reducer processes each key and its grouped values to produce aggregated results.
Final Result:
The output produced by reducers and written back to the distributed file system as the final processed data.
Parallel Efficiency of Map Reduce
Parallel efficiency of MapReduce means the effectiveness of utilizing computational resources and achieving speedup when executing a MapReduce job on a distributed computing system. It measures how well the system scales and utilizes available resources to perform the required computation in parallel.
Parallel efficiency = (actual speedup / number of processors) * 100
Here, speedup represents the improvement in performance achieved by parallel execution compared to a sequential execution. A parallel efficiency of 100% indicates that the parallel execution is utilizing all available resources effectively. However, in practice, achieving 100% is challenging.
Let us assume that the data produced after the map phase is σ times the original data size, D. That is, σD data (post-map) Further, we assume that there are P processors. Assume WD is useful work needed to be done.
Now, after the map operation, each mapper writes σD / P data to their local disk.
Overheads, σD / P
Next, this data has to be read by each reducer before it can begin reduce operations. So, each reducer has to read σD / P^2, that is, one pth of the data from particular mapper. Since there are P different reducers, the communication time that a reducer spends getting the data it requires from different mappers is
σD / P
i.e. σD / P^2 * P = σD / P.
Now, efficiency of map-reduce,
EMR = WD / P(WD/p + 2cσD/p) = 1 / 1 + 2c/W σ
Scalable: Efficiency approaches, as useful work per data item W grows independent of P. If n document, m words occurring f times per document on average, then D = nmf The map phase produces mp partial counts:
σ = mp/nmf = p/nf
and EMR = 1/1 + 2cP/Wnf = 1 / 1 + 2P / nf
here assuming w and c are same.
Hence, we notice that the ratio n/p goes very large, as expected for parallel algorithm.
Enterprise batch processing using MapReduce
The data generated by today's enterprises has been increasing at exponential rates in size. Bioinformatics applications mine databases containing terabytes of data. The transaction data of an e-commerce site may exceed millions per month. This volume of data is mined not only for billing purposes but also to find events, trends, and patterns that help these firms provide better service.
With such a large data volume, it will be difficult for a single server or node to handle it. MapReduce is a framework that lets users develop code that runs on numerous nodes without worrying about fault tolerance, dependability, synchronization, or availability. Batch processing is a type of automated task that performs computations regularly. It executes the processing code on top of inputs known as batch. The task will often read batch data from a database and save the results in the same or a separate database.
Batch processing is done with the cumulative transaction in a group once batch processing has begun. No user participation is necessary. The batch processing mechanism processes the data blocks that have previously been stored over some time. Apache Hadoop's MapReduce is the most widely used batch processing frameworks. Hadoop's MapReduce framework data processing is shown below:
Fig: Data processing using Hadoop MapReduce
Batch processing is a way of waiting and performing regularly everything periodically, such as the end of a day, week, or month. Batch processing frameworks are great for handling exceedingly big datasets that need a substantial amount of computation.