Statistics. David W. Scott
Statistics
A Concise Mathematical Introduction for Students, Scientists, and Engineers
David W. Scott
Rice University
Houston, Texas
This edition first published 2020
© 2020 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of David W. Scott to be identified as the author of this work has been asserted in accordance with law.
Registered Offices John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office 9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data has been applied for
Paperback ISBN: 9781119675846
Cover Design: Wiley
Cover Image: Courtesy of David W. Scott
To my parents, John and Nancy Scott
Preface
My aim in writing this book is to provide a self‐contained, one‐semester probability and statistics introduction that covers core material without ballooning into a huge tome. Since statistics requires an understanding of distributions and relationships (for example, predicting
from ), some introductory knowledge of multivariate calculus and linear algebra will be assumed. Examples will use the language, but they can easily be modified to other systems such as Matlab. Mathematica will be used for symbolic computations. JMP can be used to perform statistical tests in a unified manner.The course divides naturally into three sections: (1) classical probability; (2) distribution functions, density functions, and random variables; and (3) statistical inference and hypothesis testing.
In selecting material to include, I have favored models that follow directly from simple, intuitive assumptions. I have also favored statistical topics that are widely used. In this era of data science, I have occasionally selected new topics that are relevant and easily understood. For example, robustness is relevant because bad data or outliers can adversely affect classical methodology.
Students who have taken AP Statistics will have an advantage in that they will have seen a large number of cookbook statistical procedures and tests. We will cover only a selection, as the mathematical foundations (or outline thereof) will be of equal interest here. Often we will sacrifice mathematical rigor in favor of an engineering‐level understanding without apology. Motivated students will naturally follow this course with more mathematically rigorous courses in statistics, probability, and stochastic processes. Reading about other statistical tests and methods should be straightforward after mastering the material covered here.
I have included a handful of problems and case studies, to keep things simple. There will be a live course website with numerous sample problems and exams. Instructors with special interests can easily insert their own examples and problems in appropriate sections.
The URL for the additional course material is
http://www.stat.rice.edu/∼scottdw/wiley-dws-2020/
The directory contains problems, sample exams, and the pdf file all-figs.pdf, which displays all 57 figures, including 45 color diagrams. The author may be reached at [email protected]
I wish to thank James R. Thompson, who introduced me to the beauty of model building and statistical thinking. He served as one of my thesis advisers, directing me into the joys of nonparametric modeling. He was in turn highly influenced by his thesis adviser, John W. Tukey, one of the most important statisticians of the 20th century. Tukey's contributions ranged from the fast Fourier transform to the body of graphical work introduced in his monograph Exploratory Data Analysis. Their ideas appear throughout this book.
David W. Scott
Houston, Texas
September, 2019
1 Data Analysis and Understanding
The field of statistics has a rich history that has become tightly integrated into the emerging field of data sciences. Collaboration with computer scientists, numerical analysts, and decision makers characterizes the field. The role of statistics and statisticians is to find actionable information in a noisy collection of data. Every field of academic endeavor encounters this problem: from the electrical engineer trying to find a signal in a noisy channel to an English professor trying to determine the authorship of a contested newly discovered manuscript.
There are two basic tasks for the statistician. First is to characterize the distribution of possible outcomes using a batch of representative data. An actuary may be asked to find a dollar loss for car accidents that is not exceeded 99.999% of the time. An economist may be asked to provide useful summaries of a collection of income data. The histogram is our primary tool here, an idea that did not appear until the 17th century; see Graunt (1662), who analyzed death