Big Data is proving to be transformational in every aspect of our lives and may well govern decision making at all levels: in government, business and society. According to surveys 300000 Facebook users log in each minute, 200 million emails, 2 million search queries and 100000 tweets happen in that time duration generating a mind boggling 200 exabytes of data. This does not take into account an equally staggering amount of data industrial sensors generate and the voluminous data stock markets across the world generate. By 2020 the volume of data could well exceed 35 zettabytes. All this data is unstructured, semi-structured or in a raw form.
Typical uses of big data is to analyze logs whether it is from the web or machine data, to detect frauds, analyze people’s preferences and changing trends and to become an input for risk modeling that could help fine tune strategies, lower costs and achieve better results. It is just as relevant in governance, energy, health care as it is in business.
See More :
Data on such a mammoth scale is simply beyond the scope of traditionally used software and hardware structures. This led to development of an entire environment specifically to handle, process and analyze raw, unstructured or structured big data at stupendous costs and it is estimated investments could well exceed USD 120 billion by 2015. Big data technologies are capable of statistical analysis, voluminous storage, processing natural language and search. The technology handles extremely large data volumes, variety and velocity with unsurpassed ease and delivers analytics that has quite an impact on the way businesses, retail, healthcare, government and IT segments function. It also impacts people at the receiving end since, for example, large enterprises can categorize customers and deliver according to preferences.
Data is generated in massive volumes at astounding speeds each minute. Twitter, for example generates 7 terabytes while Facebook generates 10 terabytes of data each data. Infrastructure needs to have capabilities to take in such voluminous data, process it, manage it, analyze it and deliver meaningful outputs almost instantaneously. Sensors are widely used in a variety of industrial devices to generate a constant stream of data each second. Stock, currency and commodities markets across the world generate impressive volumes of data each second.
Data variety, availability and quality
There is a torrent of big data flow in various forms such as social media data, email data, document data, search data and email data from the internet and from sensors in semi structured, unstructured or raw form. The huge variety and volume of data are handled by sophisticated big data analytics tools that should, in theory, be able to “intelligently” analyze and categorize incoherent data into coherent results. In practice the variety and dynamically changing mass of data makes it difficult to expect quality data in which case output will lead to bad decision making and nullify the efforts and cost of big data analysis. Data does flow but is it exactly the data required that would be the foundation for right decision making? This is one of the challenges in big data.
Big Data velocity, veracity, quality and integration
Big data analysts can gloat over the fact that there is ample data flowing in each second. However, it is the velocity itself of data that can lead to difficulty in zeroing in on the exact piece of data or delivering up to the minute analyzed output. Veracity is another confusing factor in big data processing and analytics. In the internet for example, what people submit may or not be the actual truth while some vital, linked information may be missing, all of which affects the final outcome of analytics. Quality of data is another issue in big data processing and management. If there is 90% junk, a lot of time, money, energy and resources are directed to sifting chaff from the wheat. Combining data from diverse sources and locations and integrating into a statistically analyzed report in itself is an exercise prone to delivering skewed results. Advanced countries typically have a higher percentage of data generation whereas less developed countries with limited access to internet and other technologies still rely on manual processes and do not deliver big data flows. In such cases too results are lop sided. When it comes to analyzing demographics or people’s preferences, the challenge is to extract meaningful information without intruding into privacy, an area where government intervention would become necessary.
Hardware and software infrastructure, people skills
An emerging technology, big data is not in widespread use due to inherent software and infrastructure requirements. Typically, Hadoop is one of the technologies that addresses big data transactions that could be spread over hundreds of servers in order to handle the variety, velocity and volume of data. Technology and intelligent algorithms apart, it is essentially people who oversee the entire process, especially of analytics. Big data is different from traditional data and calls for a different mindset whereas people may have an ingrained preset way of looking at data and coloring it. Then, another challenge is the shortage of big data professionals. The technology is so new there are relatively few people with the expertise or the experience to truly parse and interpret big data or develop superior, more intelligent data sifting and analysis tools. Ultimately people are in charge and face the imposing task of insights into captured data, resolving data from different sources and transforming it into analytics. There are some areas where only experience and maturity work and this is relevant to big data too.
To sum it up the challenges in big data are:
- Acquiring and storing voluminous data
- Sifting through data and extracting useful information
- Aggregation and integration followed by representation
- Querying, data modeling and analysis
- Interpretation to give meaningful analytics
How it is currently handled
A popular methodology to analyze big data is to set up a Hadoop connection and Mapreduce with parallel databases, bringing in issues such as data transfer delays. Google Big Table and Amazon Dynamo are also in use. A typical parallel database stack would include an SQL Compiler, a relational dataflow layer and row/column storage manager. A Hadoop softwarestock typically has a subset comprising HiveQL, PigLatin, Jaql script followed by MapReduce data flow layer and then a get-put operational layer ending with a Hadoop distributed file system. HiveQL/PigLatin sit at the top with high level languages. The MapReduce layer is for batch analytics of data in partitions, working to sort and output results based on key values. The get-put operation is accessed by the Hbase key through a client app for analytics while at the bottom the distributed file system arranges for files to appear as large and contiguous sequence of bytes. Hadoop is open source and enables access to external data with support for incremental forward recovery. It can schedule large jobs, automatically breaking down data into manageable chunks in addition to replication and machine fail over. The drawbacks are that the method of layering data abstraction on sequenced byte stream file abstraction could lead to broken records. Files are split and parallel data runs on a unary MapReduce operator. In reality Hadoop is a compilation of projects such as Pig, Pig Latin, Hive, SQL, Jaql, Hbase and others which means it is not really perfect when it comes to fluidly handling big data for now and for the future. It is really not quite perfect for real time changes. However, as on date, it is the only versatile tool to handle thousands of terabytes of data and scale installations of 4000 nodes. Being open source, innovations and developments continue by hundreds of dedicated programmers to refine it.
Even as developments go on opportunities in big data abound.
Opportunities in big data
Real time analytics:
Real time analytics will help in pointed ad placements, healthcare by predictions on patient outcomes, analysis of market trends, detection of frauds and product recommendations with seasonal variations. Big data tools’ ability to sift through mega data flows from diverse sources with interactive queries results in an on the spot insight into various aspects of business operations that is simply impossible with conventional data base approaches. Big data is an economic asset as businesses learn to exploit its huge power.
There will be plenty of career opportunities in big data from highly qualified scientists to IT support staff to handle software as well as the hardware architecture. There is a shortage of 1.7 million big data professionals. Big data quantum doubles every two years and the need for professionals will rise in proportion.
Hardware suppliers will find that big data necessitates an entirely new method of architecture and will find plenty of opportunities to put in place sophisticated systems.
In a broader perspective, data from GPS devices, cell phones, computers and medical devices in developing countries could be finely analyzed to provide services as well as prevent crises for low income groups, helping governments perform much better and meet people’s expectations.
Retailers can take advantage of big data to plan product mixes, promotions, pricing and interactions with consumers resulting in a better consumer experience.
Banks and financial institutions can implement big data and manage liquidity risks more efficiently. Currently there is a lack of business intelligence related to liquidity risk analysis, a reason why quite a few banks could not address the issue, recognize the portents and eventually suffered huge losses when stress points were breached. Big data can handle voluminous amounts, process them and analyze them in real time to present decision makers a clear cut scenario of how matters stand.
Health care industry can look forward to a new era in patient care with big data implementation. Each patient’s data can be tracked, analyzed and acted upon in real time on a day to day basis, delivering better and timely care while reducing costs and complications. Response of patients to drug therapy could also help pharmaceutical companies fine tune drug research and development. In fact, big data analytics has the power to help drug companies customize drug to each patient to ensure better and faster recovery.
Big data will facilitate better content and active media management. The web has spawned an explosion of content and this comprises of text, audio, images and video in a never ending stream. There are ample opportunities to streamline, categorize and organize content with big data technologies, making it easier for people to find contextual and relevant information in a matter of seconds with unified intelligent content access systems.
Online stock trading is an area where millions of stocks are traded each day with prices fluctuating each second and hundreds of thousands of buy and sell orders. These transactions take place through human intervention and also through automated algorithm based trading resulting in high frequency transactions. A player on the stock market would find it impossible to pinpoint the maximum activity in a particular stock at a particular time and take action. There is simply too much going on that simple software or hardware cannot address or handle. This is where big data systems show their worth, effortlessly analyzing, tracking and presenting precisely the data a user wants, thus helping him gain the most.
Immigration and legal cases is another area where authorities find it difficult to sort and sift through enormous data. This is another promising avenue for big data to make a change to the way immigration and legal cases are processed.
City traffic is another area where big data can be used to positive effect to analyze vehicular flow in relation to time and seasons and also other parameters that would help planners reduce congestion and provide alternate routes for smooth traffic flows, all in real time