Data classification:
1. Social Networks (human-sourced information): this information is the record of human experiences, previously recorded in books and works of art, and later in photographs, audio and video. Human-sourced information is now almost entirely digitized and stored everywhere from personal computers to social networks. Data are loosely structured and often ungoverned.
1100. Social Networks: Facebook, Twitter, Tumblr etc.
1200. Blogs and comments
1300. Personal documents
1400. Pictures: Instagram, Flickr, Picasa etc.
1500. Videos: Youtube etc.
1600. Internet searches
1700. Mobile data content: text messages
1800. User-generated maps
1900. E-Mail
2. Traditional Business systems (process-mediated data): these processes record and monitor business events of interest, such as registering a customer, manufacturing a product, taking an order, etc. The process-mediated data thus collected is highly structured and includes transactions,reference tables and relationships, as well as the metadata that sets its context. Traditional business data is the vast majority of what IT managed and processed, in both operational and BI systems. Usually structured and stored in relational database systems. (Some sources belonging to this class may fall into the category of “Administrative data”).
21. Data produced by Public Agencies
2110. Medical records
22. Data produced by businesses
2210. Commercial transactions
2220. Banking/stock records
2230. E-commer
3. Internet of Things (machine-generated data): derived from the phenomenal growth in the number of sensors and machines used to measure and record the events and situations in the physical world. The output of these sensors is machine-generated data, and from simple sensor records to complex computer logs, it is well structured. As sensors proliferate and data volumes grow, it is becoming an increasingly important component of the information stored and processed by many businesses. Its well-structured nature is suitable for computer processing, but its size and speed is beyond traditional approaches.
31. Data from sensors
311. Fixed sensors
3111. Home automation
3112. Weather/pollution sensors
3113. Traffic sensors/webcam
3114. Scientific sensors
There is a difference when using Big Data versus data stored on traditional Data Bases, and it depends of its nature, we can characterize five type of sources:
- Sensors/meters and activity records from electronic devices: These kind of information is produced on real-time, the number and periodicity of observations of the observations will be variable, sometimes it will depend of a lap of time, on others of the occurrence of some event (per example a car passing by the vision angle of a camera) and in others will depend of manual manipulation (from an strict point of view it will be the same that the occurrence of an event). Quality of this kind of source depends mostly of the capacity of the sensor to take accurate measurements in the way it is expected.
- Social interactions: Is data produced by human interactions through a network, like Internet. The most common is the data produced in social networks. This kind of data implies qualitative and quantitative aspects which are of some interest to be measured. Quantitative aspects are easier to measure tan qualitative aspects, first ones implies counting number of observations grouped by geographical or temporal characteristics, while the quality of the second ones mostly relies on the accuracy of the algorithms applied to extract the meaning of the contents which are commonly found as unstructured text written in natural language, examples of analysis that are made from this data are sentiment analysis, trend topics analysis, etc.;
- Business transactions: Data produced as a result of business activities can be recorded in structured or unstructured databases. When recorded on structured data bases the most common problem to analyze that information and get statistical indicators is the big volume of information and the periodicity of its production because sometimes these data is produced at a very fast pace, thousands of records can be produced in a second when big companies like supermarket chains are recording their sales. But these kind of data is not always produced in formats that can be directly stored in relational databases, an electronic invoice is an example of this case of source, it has more or less an structure but if we need to put the data that it contains in a relational database, we will need to apply some process to distribute that data on different tables (in order to normalize the data accordingly with the relational database theory), and maybe is not in plain text (could be a picture, a PDF, Excel record, etc.), one problem that we could have here is that the process needs time and as previously said, data maybe is being produced too fast, so we would need to have different strategies to use the data, processing it as it is without putting it on a relational database, discarding some observations (which criteria?), using parallel processing, etc. Quality of information produced from business transactions is tightly related to the capacity to get representative observations and to process them;
- Electronic Files: These refers to unstructured documents, statically or dynamically produced which are stored or published as electronic files, like Internet pages, videos, audios, PDF files, etc. They can have contents of special interest but are difficult to extract, different techniques could be used, like text mining, pattern recognition, and so on. Quality of our measurements will mostly rely on the capacity to extract and correctly interpret all the representative information from those documents;
- Broadcastings: Mainly referred to video and audio produced on real time, getting statistical data from the contents of this kind of electronic data by now is too complex and implies big computational and communications power, once solved the problems of converting “digital-analog” contents to “digital-data” contents we will have similar complications to process it like the ones that we can find on social interactions.
2. OVERVIEW OF CLASSIFICATION Classification is one of the data mining technique that classifies unstructured data into the structured class and groups and it helps to user for knowledge discovery and future plan [3]. Classification provides intelligent decision making. There are two phases in classification, first is learning process phase in which a huge training data sets are supplied and analysis takes place then rules and patterns are created. Then the execution of second phase start that is evaluation or test of data sets and archives the accuracy of a classification patterns. This section briefly describes the supervised classification methods such as Decision Tree and Support Vector Machine.
Machine learning:
Introduction Machine learning is an important area of artificial intelligence. The objective of machine learning is to discover knowledge and make intelligent decisions. Machine learning algorithms can be categorized into supervised, unsupervised, and semi-supervised. When big data is concerned, it is necessary to scale up machine learning algorithms (Chen and Zhang, 2014; Tarwani et al., 2015). Another categorization of machine learning according to the output of a machine learning system includes classification, regression, clustering, and density estimation, etc. Machine learning approaches include decision tree learning, association rule learning, artificial neural networks, support vector machines (SVM), clustering, Bayesian networks, and genetic algorithms, etc. Machine learning has been used in big data. Big data is a massive volume of both structured and unstructured data that is so large that it is difficult to process using traditional database and software techniques.
Methods of Machine Learning and Big Data Supervised learning can be categorized into classification and regression. When the class attribute is discrete, it is classification; when the class attribute is continuous, it is called regression. Decision tree learning, naive Bayes classifier, and k-nearest neighbor (k-NN), etc. are classification methods. Linear regression and logistic regression are regression methods. Unsupervised learning divides instances into groups of similar objects (Zafarani et al., 2014). Clustering can be grouped into three categories. They are supervised, unsupervised, and semisupervised (Dean, 2014): Supervised clustering: It identifies clusters that have high probability densities with respect to individual classes. It is used when there is a target variable and a training set including the variables to cluster. Unsupervised clustering: It maximizes the intracluster similarity and minimizes the intercluster similarity when a similarity/dissimilarity measure is given. It uses a specific objective function. K‐means and hierarchical clustering are the most commonly used unsupervised clustering methods in segmentation. Semi-supervised clustering: In addition to the similarity measure, semi-supervised clustering uses other guiding/adjusting domain information to improve clustering. This domain information can be pairwise constraints between the observations or target variables for some of the observations. Decision trees classify data based on their feature values. Decision trees are constructed recursively from training data using a top-down greedy approach in which features are sequentially selected (Zafarani et al., 2014). Decision tree classifiers organize the training data into a tree-structure plan. Decision trees are constructed by stating with the root node having the whole data set, iteratively choosing splitting criteria and expanding leaf nodes with partitioned data subsets according to the splitting criteria. Splitting criteria are chosen based on some quality measures such as information gain, which requires handling the entire data set of each expanding nodes. This makes it difficult for decision trees to be applied to big data applications (Lee, 2014).
Support vector machine (SVM) is a binary classifier which finds linear classifier in higher dimensional feature space to which original data space is mapped. SVM shows very good performance for data sets in a moderate size. It has inherent limitations to big data applications (Lee, 2014).
Deep machine learning has become a research frontier in artificial intelligence. It is a machine learning technique, where many layers of information processing stages are exploited in hierarchical architectures. It computes hierarchical features or representations of the observational data, where the higher-level features are defined from lower-level ones. Deep learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. While deep learning can be applied to learn from labeled data, it is primarily attractive for learning from large amounts of unlabeled data, making it attractive to extract meaningful representations or patterns from big data. Deep learning algorithms and architectures are more aptly
Challenges of Machine Learning Applications in Big Data
General challenges about machine learning are: (i) designing scalable and flexible computational architectures for machine learning; (ii) the ability to understand characteristics of data before applying machine learning algorithms and tools; and (iii) the ability to construct, learn and infer with increasing sample size, dimensionality, and categories of labels (Sukumar, 2014).
There are many scale machine learning algorithms, but many important specific sub-fields in largescale machine learning, such as large-scale recommender systems, natural language processing, association rule learning, ensemble learning, still face the scalability problems (Chen and Zhang, 2014).
The basic Map Reduce framework commonly provided by first-generation “Big Data analytics” platforms like Hadoop lacks an essential feature for machine learning. Map Reduce does not support iteration /recursion or certain key features required to efficiently iterate “around” a Map Reduce program. Programmers building machine learning models on such systems have to implement looping in ad-hoc ways outside the core Map Reduce framework. This lack of support has motivated the recent development of various specialized methods or libraries to support iterative programming on large clusters. Meanwhile, recent Map Reduce extensions such as HaLoop, Twister, and PrItr aim at directly addressing the iteration outage in Map Reduce (Bu et al., 2012).
Major problems that make the machine learning (ML) methods unsuitable for solving big data classification problems are: (i) An ML method that is trained on a particular labeled datasets may not be suitable for another dataset – that the classification may not be robust over different datasets; (ii) an ML method is generally trained using a certain number of class types and thus a large varieties of class types found in a dynamically growing dataset will lead to inaccurate classification results; and (iii) an ML method is developed based on a single learning task, and therefore they are not suitable for today’s multiple learning tasks and knowledge transfer requirements of Big data analytics (Suthaharan, 2014).
Traditional algorithms in ML generally do not scale to big data. The main difficulty lies with their memory constraint. Although algorithms typically assume that training data samples exist in main memory, big data does not fit into it. A common method of learning from a large dataset is data distribution. By replacing batch training on the original training dataset with separated computations on the distributed subsets, one can train an alternative prediction model at a sacrifice of accuracy. Another approach is using online learning, in which memory usage does not depend on dataset size. Both online learning and distributed learning are not sufficient for learning from big data streams. There are two reasons. First is that the data size is too big to be relaxed by either online or distributed learning. Sequential online learning on big data requires too much time for training on a single machine. On the other hand, distributed learning with a big number of machines reduces the gained efficiency per machine and affects the overall performance. The second reason is that combining real-time training and prediction has not been studied. Big data is used after being stored in (distributed) storage; therefore, the learning process also tends to work in a batch manner (Hido et al., 2013).
Scaling up big data to proper dimensionality is a challenge that can encounter in machine learning algorithms; and there are challenges of dealing with velocity, volume and many more for all types of machine learning algorithms. Since big data processing requires decomposition, parallelism,
Big Data for Business
Due Date 28/09/2020
- Which of the following is an example of big data utilized in action today?
- Individual, Unconnected Hospital Databases
- Wi-Fi Networks
- The Internet
- Social Media
- What reasoning was given for the following: why is the “data storage to price ratio” relevant to big data?
- Lower prices mean larger storage becomes easier to access for everyone, creating bigger amounts of data for client-facing services to work with.
- It isn’t, it was just an arbitrary example of big data usage.
- Companies can’t afford to own, maintain, and spend the energy to support large data storage unless the cost is sufficiently low.
- Larger storage means easier accessibility to big data for every user because it allows users to download in bulk.
- What is the best description of personalized marketing enabled by big data?
- Marketing to each customer on an individual level and suiting to their needs.
- Being able to obtain and use customer information for groups of consumers and utilize them for marketing needs.
- Being able to use personalized data from every single customer for personalized marketing needs.
- Of the following, which is an example of personalized marketing related with big data?
- News outlets gathering information from the internet in order to report them to the public.
- A survey that asks your age and markets to you a specific brand.
- Google ordering ads to show items based on recent and past search results.
- What is the workflow for working with big data?
- Extrapolation -> Understanding -> Reproducing
- Theory -> Models -> Precise Advice
- Big Data -> Better Models -> Higher Precision
- Which is the most compelling reason why mobile advertising is related to big data?
- Mobile advertising allows massive cellular/mobile texting to a wide audience, thus providing large amounts of data.
- Since almost everyone owns a cell/mobile phone, the mobile advertising market is large and thus requires big data to contain all the information.
- Mobile advertising benefits from data integration with location which requires big data.
- Mobile advertising in and of itself is always associated with big data.
- What are the three types of diverse data sources?
- Information Networks, Map Data, and People
- Sensor Data, Organizational Data, and Social Media
- Machine Data, Organizational Data, and People
- Machine Data, Map Data, and Social Media
- What is an example of machine data?
- Social Media
- Weather station sensor output.
- Sorted data from Amazon regarding customer info.
- What is an example of organizational data?
- Satellite Data
- Disease data from Centre for Disease Control.
- Social Media
- Of the three data sources, which is the hardest to implement and streamline into a model? a) Machine Data
- Organizational Data
- People
- Where does the real value of big data often come from?
- Having data-enabled decisions and actions from the insights of new data. b) Size of the data.
- Combining streams of data and analysing them for new insights.
- Using the three major data sources: Machines, People, and Organizations. 12- What does it mean for a device to be “smart”?
- Connect with other devices and have knowledge of the environment.
- Must have a way to interact with the user.
- Having a specific processing speed in order to keep up with the demands of data processing.
13- What is the purpose of retrieval and storage; pre-processing; and analysis in order to convert multiple data sources into valuable data? a) To enable ETL methods.
- Designed to work like the ETL process.
- To allow scalable analytical solutions to big data.
- Since the multi-layered process is built into the Neo4j database connection. 14- What are data silos and why are they bad?
- Highly unstructured data. Bad because it does not provide meaningful results for organizations.
- A giant centralized database to house all the data produces within an organization. Bad because it is hard to maintain as highly structured data.
- Data produced from an organization that is spread out. Bad because it creates unsynchronized and invisible data.
- A giant centralized database to house all the data production within an organization. Bad because it hinders opportunity for data generation.
15- What does the term “in situ” mean in the context of big data?
- Accelerometers
- In the situation
- Bringing the computation to the location of the data.
- The sensors used in airplanes to measure altitude.
Name: Student ID:
Answer Sheet | |
Question | Answer |
1 | D |
2 | A |
3 | C |
4 | C |
5 | C |
6 | C |
7 | C |
8 | B |
9 | B |
10 | C |
11 | C |
12 | A |
13 | C |
14 | C |
15 | D |
- Explain and describe the three types of diverse data sources.
The data source is the original location where the data is developed and from where we can get the digital information. There are three different types of diverse data sources and they are primary, secondary, and tertiary data sources. Primary data sources include the origins of originality, original content, function, etc. It also consists of fieldwork, initial studies, interviews, etc.
- Explain what the challenges to data with the high valence are.
The challenges to data with high valence are described as bellows:
- Complex data exploration algorithms: With the introduction of technology such as cloud computing, the priority must be that, if there is a mistake, the harm incurred must be below a reasonable threshold, rather than the whole job that has to be done again. Fault-tolerant programming is time-consuming and involves highly complex algorithms. A foolproof, one hundred percent stable fault-tolerant system or programme is literally a far-fetched concept.
- Data heterogeneity: In the present world approximately 80% of data are unstructured data. To work with such unstructured data is often difficult to understand and costly as well. It is therefore difficult to transform these unstructured data into structured one. It included almost any kind of data we generate on a daily basis, such as contact with social media, sharing of files, fax transfers, emails, tweets, and many more. So this is the another big challenge to data with high valence.
- Data quality: As we know, the storing of huge data is very expensive and not affordable and also there is often a disparity between corporate executives and IT experts in the amount of data the enterprise or entity holds. In this situation the most important think to look at is accuracy of the data. In fact, there is not a logic to keep very large data sets because these data have no value and are meaningless. As well we cannot draw any conclusion based on these types of data.
- Scalability: the big data scalability problem has now contributed on to cloud computing. This involves a high degree of labor sharing that is very costly and carries with it numerous difficulties, such as the execution of multiple tasks, so that the target of each workload is effectively met. It also needs to deal with device faults in an effective way, which is very common when dealing with large clusters.