Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

Read it now on the O’Reilly learning platform with a 10-day free trial.

O’Reilly members get unlimited access to books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Book description

The Complete Guide to Data Science with Hadoop—For Technical Professionals, Businesspeople, and Students

Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. Practical Data Science with Hadoop® and Spark is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials.

The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization.

Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP).

This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives.

Show and hide more Table of contents Product information

About This E-Book
Title Page
Copyright Page
Contents
Foreword
Preface
1. Focus of the Book
2. Who Should Read This Book
3. How to Use This Book
4. Book Conventions
5. Accompanying Code
1. 1. Introduction to Data Science
  1. What Is Data Science?
  2. Example: Search Advertising
  3. A Bit of Data Science History
    1. Statistics and Machine Learning
    2. Innovation from Internet Giants
    3. Data Science in the Modern Enterprise
    1. The Data Engineer
    2. The Applied Scientist
    3. Transitioning to a Data Scientist Role
    4. Soft Skills of a Data Scientist
    1. Ask the Right Question
    2. Data Acquisition
    3. Data Cleaning: Taking Care of Data Quality
    4. Explore the Data and Design Model Features
    5. Building and Tuning the Model
    6. Deploy to Production
    1. Big Data—A Driver of Change
      1. Volume: More Data Is Now Available
      2. Variety: More Data Types
      3. Velocity: Fast Data Ingest
      1. Product Recommendation
      2. Customer Churn Analysis
      3. Customer Segmentation
      4. Sales Leads Prioritization
      5. Sentiment Analysis
      6. Fraud Detection
      7. Predictive Maintenance
      8. Market Basket Analysis
      9. Predictive Medical Diagnosis
      10. Predicting Patient Re-admission
      11. Detecting Anomalous Record Access
      12. Insurance Risk Analysis
      13. Predicting Oil and Gas Well Production Levels
      1. What Is Hadoop?
        
        Distributed File System
        
        Resource Manager and Scheduler
        
        Distributed Data Processing Frameworks
        
        Apache Sqoop
        
        Apache Flume
        
        Apache Hive
        
        Apache Pig
        
        Apache Spark
        
        R
        
        Python
        
        Java Machine Learning Packages
        
        Cost Effective Storage
        
        Schema on Read
        
        Unstructured and Semi-Structured Data
        
        Multi-Language Tooling
        
        Robust Scheduling and Resource Management
        
        Levels of Distributed Systems Abstractions
        
        Scalable Creation of Models
        
        Scalable Application of Models
        
        4. Getting Data into Hadoop
        
        Hadoop as a Data Lake
        
        The Hadoop Distributed File System (HDFS)
        
        Direct File Transfer to Hadoop HDFS
        
        Importing Data from Files into Hive Tables
        
        Import CSV Files into Hive Tables
        
        Import CSV Files into HIVE Using Spark
        
        Import a JSON File into HIVE Using Spark
        
        Data Import and Export with Sqoop
        
        Apache Sqoop Version Changes
        
        Using Sqoop V2: A Basic Example
        
        Using Flume: A Web Log Example Overview
        
        Why Hadoop for Data Munging?
        
        Data Quality
        
        What Is Data Quality?
        
        Dealing with Data Quality Issues
        
        Using Hadoop for Data Quality
        
        Choosing the “Right” Features
        
        Sampling: Choosing Instances
        
        Generating Features
        
        Text Features
        
        Time-Series Features
        
        Features from Complex Data Types
        
        Feature Manipulation
        
        Dimensionality Reduction
        
        Why Visualize Data?
        
        Motivating Example: Visualizing Network Throughput
        
        Visualizing the Breakthrough That Never Happened
        
        Comparison Charts
        
        Composition Charts
        
        Distribution Charts
        
        Relationship Charts
        
        R
        
        Python: Matplotlib, Seaborn, and Others
        
        SAS
        
        Matlab
        
        Julia
        
        Other Visualization Tools
        
        7. Machine Learning with Hadoop
        
        Overview of Machine Learning
        
        Terminology
        
        Task Types in Machine Learning
        
        Big Data and Machine Learning
        
        Tools for Machine Learning
        
        The Future of Machine Learning and Artificial Intelligence
        
        Summary
        
        Overview of Predictive Modeling
        
        Classification Versus Regression
        
        Evaluating Predictive Models
        
        Evaluating Classifiers
        
        Evaluating Regression Models
        
        Cross Validation
        
        Model Training
        
        Batch Prediction
        
        Real-Time Prediction
        
        Tweets Dataset
        
        Data Preparation
        
        Feature Generation
        
        Building a Classifier
        
        Overview of Clustering
        
        Uses of Clustering
        
        Designing a Similarity Measure
        
        Distance Functions
        
        Similarity Functions
        
        k-means Clustering
        
        Latent Dirichlet Allocation
        
        Data Ingestion
        
        Feature Generation
        
        Running Latent Dirichlet Allocation
        
        Overview
        
        Uses of Anomaly Detection
        
        Types of Anomalies in Data
        
        Approaches to Anomaly Detection
        
        Rules-based Methods
        
        Supervised Learning Methods
        
        Unsupervised Learning Methods
        
        Semi-Supervised Learning Methods
        
        Data Ingestion
        
        Building a Classifier
        
        Evaluating Performance
        
        Natural Language Processing
        
        Historical Approaches
        
        NLP Use Cases
        
        Text Segmentation
        
        Part-of-Speech Tagging
        
        Named Entity Recognition
        
        Sentiment Analysis
        
        Topic Modeling
        
        Small-Model NLP
        
        Big-Model NLP
        
        Bag-of-Words
        
        Word2vec
        
        Stanford CoreNLP
        
        Using Spark for Sentiment Analysis
        
        Automated Data Discovery
        
        Deep Learning
        
        Summary
        
        Quick Command Dereference
        
        General User HDFS Commands
        
        List Files in HDFS
        
        Make a Directory in HDFS
        
        Copy Files to HDFS
        
        Copy Files from HDFS
        
        Copy Files within HDFS
        
        Delete a File within HDFS
        
        Delete a Directory in HDFS
        
        Get an HDFS Status Report (Administrators)
        
        Perform an FSCK on HDFS (Administrators)
        
        General Hadoop/Spark Information
        
        Hadoop/Spark Installation Recipes
        
        HDFS
        
        MapReduce
        
        Spark
        
        Essential Tools
        
        Machine Learning
        
        Show and hide more
        Product information
        
        Title: Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale
        
        Author(s): Ofer Mendelevitch, Casey Stella, Douglas Eadline
        
        Release date: December 2016
        
        Publisher(s): Addison-Wesley Professional
        
        ISBN: 9780134029733
        
        You might also like
        
        Check it out now on O’Reilly
        
        Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale

Book description

Table of contents

Product information

You might also like

Check it out now on O’Reilly