I'm excited about Microsoft acquiring Revolution R. Microsoft has created business friendly tools for many decades and this is a great opportunity for them to bring the power of R to common users. Waiting to hear more from Microsoft...
A hero's farewellJacques Kallis is retiring today (Dec 30, 2013) from Test cricket at the age of 38. He will continue to play One Day International and T20 cricket.Hats off to a magnificent all rounder and a wonderful athlete. The way he carried himself on and off the cricket field is remarkable.Kallis is a true champion and a great role model for future generations. Thank you Kallis for inspiring us for two decades. Today, as I watch South Africa beat India in the Durban test, I cannot think of a better send-off for Kallis.1. South Africa beat India convincingly2. Kallis scored a century in this test3. He completed 200 catches in test cricket in this gameAs I watch him retire, I did some analysis to compare Kallis with other greats in cricket. I've shared it in this post. This analysis was done using R, Shiny and Ruby.Best all rounders in Test cricket historyOver the years, Kallis gave his team a critical advantage and a wonderful balance as an allrounder. South Africa could pick an extra batsman or a bowler depending on the opposition and the conditions.Kallis and Sobers stand out in all round performance in Test cricket historyIn Test cricket, Kallis = Rahul Dravid + Zaheer KhanKallis has performed as well as a successful specialist Indian batsman and a successful specialist Indian bowler combined in Test cricket.Jacques Kallis = Rahul Dravid + Zaheer KhanTest cricket battingMatInnsNORunsHSAveBFSR100504s6sCtStJacques Kallis165279401317422455.122858746.0844581475972000Rahul Dravid164286321328827052.313125842.5236631654212100Kallis = Dravid in Test Cricket (Batting)Test cricket bowlingMatInnsBallsRunsWktsBBIBBMAveEconSR4w5w10Jacques Kallis1652712016694992926/549/9232.532.8269.0750Zaheer Khan891601797597683007/8710/14932.563.2659.915101Kallis ~ Zaheer Khan in Test Cricket (Bowling)In One Day International cricket, Kallis = Sourav Ganguly + Abdul RazzaqKallis has performed as well as a successful specialist Indian batsman and a successful specialist Pakistani bowler combined in One Day cricket.Jacques Kallis = Sourav Ganguly + Abdul RazzaqODI cricket battingMatInnsNORunsHSAveBFSR100504s6sCtStJacques Kallis325311531157413944.861586672.9417869111371290Sourav Ganguly311300231136318341.021541673.70227211221901000Kallis = Sourav Ganguly in One Day Cricket (Batting)MatInnsBallsRunsWktsBBIBBMAveEconSR4w5w10Jacques Kallis3252831075086802735/305/3031.794.8439.3220Abdul Razzaq2652541094185642696/356/3531.834.6940.6830Kallis = Abdul Razzaq in One Day Cricket (Bowling)So who is the best all rounder? Kallis or Sobers?ESPN's analysis of Kallis and SobersIs it Kallis or Gary Sobers? I won't get into the religious debate of declaring him as the best all rounder in test cricket history. Sobers and Kallis are both great all rounders - prolific run scorers and threat to the opposing batsmen.I've had the privilege of watching Kallis in my lifetime and he is a great athlete - a batsman and a great bowler rolled into one. His performance in both Test and One Day cricket has been stellar. He's clearly the best all rounder in One Day cricket, and gives Sobers tough competition in Test.Kallis is the best all rounder in One Day cricket historyKallis batting performance in One Day cricketHope South Africa and international cricket find someone of Kallis' stature. Reference: espncricinfo.com (some images and player statistics)
Many of my software engineer friends ask me about learning data science. There are many articles on this subject from renowned data scientists (Dataspora, Gigaom, Quora, Hilary Mason). This post captures my journey (a software engineer) on learning Statistics and Data Visualization.I'm mid-way in my 5 year journey to become proficient in data science and my learning program has included self-learning (books, blogs, toy problems), projects at work, class-room training (Stanford), teaching/presentations, conferences (UseR, Strata). Here's what I've done so far and what worked and what didn't...1. GETTING STARTEDa) Self-learning (2 - 4 months)Explore if data science is for youThis is the key to getting started. Two years ago some of us at work formed a study group to review Stats 202 class material. This is what got me excited and started with data analytics. Only 2 of the 5 members of our study group chose to dive deeper into this field (data science is not for everyone).Learn basic statistics: Stats 202 coursework is perfect for thisLearn a statistical tool: I spent 3 months heads-down learning R as a new-bee and had the most fun doing so. Why learn R?Solve toy problems: Curiosity is key to data science. If you've questions about your country's economy, crime stats, sports performance, get the data and start answering your questionsLearn Unix tools: I picked O'Reilly's Data Analysis with Open Source Tools (A hands-on guide for programmers and data scientists) book to read.Learn SQL and scripting languages: I know Java, Ruby and SQL. Python is on my list.There's a lot of training material available onlineStats 202Caltech Data Science courseCoursera: Introduction to Data Science, Machine learning, Data Analysis, Computing for Data AnalysisUniversity of California Berkeley - Introduction to Data ScienceKnight Center for Journalism's course on Introduction to Infographics and Data VisualizationStats 101: Udacity (Intro to Stats), Khan academy, Carnegie Mellon's stats courseLearn Rb) Class-room training (9 - 12 months)If you're serious about learning, enroll into a formal programIf you're serious about picking this skill, then opt for a course. The rigor of the class ensured that I didn't slack. Stanford offers great coursework to get started. They are far superior compared to many week-long training courses I've been to...Data Mining and Analysis STATS202Linear and Nonlinear Optimization MS&E211Mining Massive Data Sets CS246Modern Applied Statistics: Learning STATS315AStatistical Methods in Finance STATS240PModern Applied Statistics: Data Mining STATS315B2. GETTING FOCUSEDa) Spend 100% of my time on data scienceOnce I was hooked on data science, it was difficult to spend only 20% of my time on it to build expertise. I needed to spend 100% of my time on it, so I found work problems related to data science (big data analysis, healthcare, marketing & sales and retail analytics, optimization problems). b) Work on interesting problemsI aligned my learning goal with my passion. I found it energizing and engaging to solve interesting problems while learning new techniques. I was interested in retail, healthcare and sports (cricket) data analysis. c) Accelerate learning: Teach: I taught R and data mining introductory classes to colleagues and friends. This helped me reinforce my learning and get others excited on this topic. This is also a great way for me to give back to the open source community. Blogging is another medium to contribute and learnFollow the leaders in data science and network with data scientists: DJ Patil, Hillary Mason, Jeff Hammerbacher, Carla Gentry, Monica Rogati, Cathy O'Neil. There are many others in this space. Apologies for missing out many of them. These are the people I look up to.Follow interesting blogs: http://datascience101.wordpress.com, http://columbiadatascience.com/blog, http://www.r-bloggers.com, http://www.datawrangling.com, http://flowingdata.com (Quora's best data blog list)Attend conferences/meetups periodically: Local data science/R meetups, O'Reilly Strata is great! Given how rapidly this field is evolving, I go there at least every other year. UseR is wonderful to see what's happening in the world of RLearn Big Data techniques: MapReduce/Hadoop, Cloud computing. I avoided picking any commercial, vendor technology and in retrospect, it was a good decision.d) Learn business domainsI'm lucky to have access to internal and external experts in data science, and they've helped me understand their approach to data science problems (how they think, hypothesize and test/assess/reject solutions). I've learned from them the importance of "Hypothesis-driven data analysis" rather than "blind/brute-force data analysis". This highlighted the importance of understanding the business domains really well before trying to extract meaningful insights from the data. This led me to understand operations research and marketing topics, retail, travel & logistics (revenue management) and healthcare industries. NY Times recently published an article highlighting the need for intuition.3. DATA SCIENCE BOOKS I FOUND USEFULIntroduction to Data Mining by Tan, Steinback and Kumar This is the textbook used in many introductory data science courses, including Stats 202 at Stanford. Great guide to keep handyR in a nutshellData Analysis by using Open Source toolsBeautiful visualizationSee more books on data science: O'Reilly, Manning4. WHAT DIDN'T WORK FOR MELearning multiple Statistical tools: A year ago, I started getting some work requests for SAS programming, so I wanted to learn it. I tried to learn it for a month or so but could not do it. The main reason was learning inertia and my love for the statistical tool I knew already - R. I really didn't need another statistical tool. I could solve most of my data science problems with R and other software tools I knew. So my advice is that if you already know SAS, Stata, Matlab, SPSS, Statistica very well, stick to it. However if you're learning a new statistical tool, pick R. R is open source while most others are commercial software (expensive and complex).Auditing courses: I tried to follow self-paced coursework from Coursera and other MOOCs but it wasn't effective for me. I needed the routine, the pressure of a formal course with proper grading to go through the rigorIncreasing academic workload: Manage work-life balance and work-commitments well. Earlier this year, I tried to take multiple difficult courses at the same time and quickly realized that I wasn't enjoying and learning as I should.Sticking to course text book only: Many of the books in these classes are too "dense" for me (a software engineer). So I used other material to understand the concepts. E.g. regression from Carnegie Mellon notesComments, questions, suggestions are welcome!
I was at Strata New York 2012 last month. Great conference! Thanks O'Reilly media for assembling the industry leaders and running it well.I understand it was too crowded for some of my out-of-town friends. Stepping out to the streets of mid-town Manhattan for a breath of fresh air and calmness wasn't an option either. Maybe O'Reilly can get a bigger space next year?My primary interest in Big Data analysis was structured data analysis i.e. crunching, munging (ETL) and analysis of large dataset in columns and rows.My team deals with 1-2 Terabytes (~ 1 billion rows) of structured data (e.g sales transaction data) regularly for marketing/retail/healthcare analytics. Like others, we're spending a lot of time in Big Data ETL processes and less on Big Data Analysis. Someone at Strata New York captured this well,80% of a Big Data development effort goes into data integration efforts and only 20% of our effort/time is spent on analysis, i.e. interesting things we want to work onI want to flip this equation. I want to be able to spend 20% of our time/effort on Big Data ETL/integration efforts and 80% on Big Data analysis.At Strata, I wanted to check if vendors and open source communities had simplified the Hadoop stack for my use. There have been many improvements in the past year and the products on my list is quite long. We've more players in Big Data space and the solution space is muddy. It is great to have more vendors, experts, communities focusing on Big Data but the product space is crowded, fragmented and CONFUSING (just listing all Apache products discussed at Strata needs 1 full page.)I created a list of products to try out. I wish these products were easy to evaluate (legal paperwork, infrastructure footprint, and ease of setup and execute my use cases).As far as ease of use and powerful data science workbench is concerned, I want to use something like R or even Excel for these steps (Big Data ETL and Big Data analysis), but they are both memory constrained. So, I need other options.Did someone say, Hadoop? Yup, on it. I tried it a few years ago and we're exploring it now. Hadoop/MapReduce is THE infrastructure to power Big Data ETL and Big Data analysis.I also believe that MapReduce and Databases are complementary technologies and the experts agree! See MapReduce and Parallel DBMSs: friends or foes?Here's how I've framed my problem and thinking about the solution space.Big Data ETL processTake big structured dataset (multiple CSV files with total 100M-1B rows) and create DDL/clean/transform/split/sample/separate errors in minutes.Solution options:Unix scripts (shell, awk, perl). Do all of the above in one pass (i.e. read each row only once) quickly. Start with Unix parallel processes, scale to multiple machines (mapreduce-style) only if neededBig Data ETL tools like Kafka?Open source ETL tools (e.g. Talend)Can commercial ETL tools do this in a few minutes/hours?Others?Given any structured data from client (csv), our Big Data ETL workbench takes the data and processes it super fast (detect data types, clean, transform eg. change date formats to our internal standards, separate error rows, create sample, split into multiple clean files)Raw data files have different schema that we auto-detect in processing (only string vs numeric types to begin with).Big Data AnalysisThen we load this data for analysis:RDBMS for well-defined arithmetic/set-based analysisnoSQL database (Lucene/Solr with Blacklight front-end for discovery). Blacklight project: Open source discovery app built on Lucene/Solr. Thinking of it as a discovery app for Big Data analysis. Facets on top of structured data. Slide-dice large structured dataset. We can add visualizations later (e.g. summaries etc.) Checkout this Stata session on Lucene-powered Big Data analysis which confirmed this design hypothesisThe clean split files can be used in any stats tool as well for statistical analysis e.g. SAS for larger data sets, R for smaller ones (often the clean split files are small enough for R)New Big Data tools like ImpalaWe're building proof-of-concepts for Big Data ETL (Unix scripts) and Blacklight discovery app on top of Lucene/Solr. I will share it when its ready. Stay tuned.
I'm training some of my colleagues on Big'ish data analysis this week. Here's how I'm running the class. Would love your ideas to make it better. CLASS OBJECTIVES (LEARNING OUTCOMES)After completion of the course, you will be able to:Understand concepts of data science, related processes, tools, techniques and path to building expertiseUse Unix command line tools for file processing (awk, sort, paste, join, gunzip, gzip)Use Excel to do basic analysis and plotsWrite and understand R code (data structures, functions, packages, etc.)Explore a new dataset with ease (visualize it, summarize it, slice/dice it, answer questions related to dataset)Plot charts on a dataset using RCLASS PREREQUISITESGood knowledge of basic statistics (min, max, avg, sd, variance, factors, quantiles/deciles, etc.)Familiarity with Unix OSCLASS TOPICSA) Intro to data scienceExplain data science and its importance. Data-driven business functions e.g. MROI, mix optimization, IPL teams / fantasy teams, predictionsBig data- Definition: Data sets that no longer fit on a disk, requiring compute clusters and respective software and algorithms (map/reduce running on Hadoop).- Real big data problems: parallel computing, distributed computing, cloud, hadoop, casandra- Most analysis isn't Big Data. Business apps often deal with datasets that fit in Excel/AccessProducts: Desktop tools (Excel (solver, what if), Access, SQL, spss, stata, R, sas, programming languages (ruby, python, java) -- stats libs in these languages, BI tools, etc.B) Steps in data scienceAcquire data: "obtaining the data"... databases, log files... exports, surveys, web scraping etc.Verify dataCleanse and transform data: outliers, missing values, dedupe, mergeExplore data: The first step when dealing with a new data set needs to be exploratory in nature: what actually is in the data set? Summarize, Visually inspect entire data- What does the data look like? summaries, cross-tabulation- What does knowing one thing tell me about another? Relationships between data elements- What the heck is going on?Visualize dataInteract with data (not covered here): BI tools, custom dashboards, other tools (ggobi etc.)Archive data (not covered here)C) Skills needed for data scienceStatistics: Concepts, approach, techniquesDatabasing: SQLScripting language: Ruby, PythonRegExVisual design: Story telling with chartsFile handling: Unix preferred. awk, gzip, gunzip, paste, sort etc.Office tools: Excel (plugins like Solver, What If)Statistical tools: R, SAS, SPSS, Stata, MATLAB, etc.BI tools: Qlikview, TableauD) Learning RWe will pick a tool to learn the concepts of data science. We will use R, a leading open source stats package. Why I started learning data science and picked RCurriculum for Intro to R (R has steep learning curve. Purpose of this discussion is to get you started)E) Where to go from here?Learn adv techniques: sampling, predictions. Books, ConferencesAnalyse your favorite dataset: e.g. Cricket data analysisCompete (kaggle)Learn other tools (Excel Solver, SAS etc.)REFERENCETutorialsStats202 classUCLA's mini course on RR introR fundamentalsR data import/exportR-bloggersWeb app integrationRTipsTBDBooksTBD
Today when I reminded a friend not to lose hope in Indian cricket (after recent whitewash in England and Australia), another friend commented, प्रसून जी , यह भारतीय क्रिकेट है ..यहाँ पर हर विक्टरी दुसरे दिन पुरानी हो जाती है ...You have to perform at your best ... After all they are getting unexpected money. They should deliver the goods as per citizens expectations...My reply to my friend was, Agreed, they need to perform... And here's my explanation of our recent performance... After Aussie dominance ended a few years ago with Warne/Gilchrist/McGrath/others retiring, we've hit a period where most top Test teams are equal (Eng/Aus/SA/SL/Ind)... As a result, the #1 Test team in last 2-3 years has been the team that played most games at HOME. And I expect this trend to continue. Here's some proof: India became #1 by playing all strong team at home in 2008-10 period... So did England in 2010-11 and became #1... now look at what Pak is doing to Eng outside Eng (down by 2-0)... Aussies won at home 4-0 against us in 2011/12 but India beat Aus 2-0 not too long ago (at HOME)...Now, don't misunderstand me... I'm not saying India's terrible performance is okay. It is NOT. Indian fans deserve to be pissed. Recent performance is terrible. Several innings defeat. No one firing. Giving up so easily. Really bad. But I'm highlighting the fact that the current years will cause joy and heart break for fans depending on where their teams are playing (Home games will bring JOY and Away games will BREAK THEIR HEARTS). Cricket fans, let us prepare ourselves for this. Our team's Test cricket performance will more or less depend on where they're playing (checkout ICC's Future Tours Program). Test cricket ranking won't mean much! Wasim Akram feels the same way.Here are some charts that explain my point of view... 1) See how India played more tests at home between 2008 and 2010. In 2011/12, they've mostly AWAY games. Similarly, in 2011 England played mostly at HOME2) Notice how we see more RED (losses) than GREEN (wins) in the charts below, India and England's performance in AWAY games3) Notice how we see more GREEN (wins) than RED (losses) in the charts below, India and England's performance in HOME games
Rahul Dravid said this at the Bradman Oration. It stuck with me...One of the things, Bradman said has stayed in my mind. That the finest of athletes had, along with skill, a few more essential qualities: to conduct their life with dignity, with integrity, with courage and modesty. All this he believed, were totally compatible with pride, ambition, determination and competitiveness.
Many of my friends and family members ask me how I continue to follow Indian cricket when we are doing so bad (whitewash in Test Cricket in England in late 2011 and now in Australia (well almost there))...My answer is that there are ups and downs in sports and my love for the team and the sport isn't based on convenience, it isn't based on my team doing well all the time. Sometimes we do well (hey, we just won the World Cup 9 months ago and were #1 Test team just a few months ago) and sometimes we don't (pathetic test cricket performance recently).I am a fan of Indian cricket. I don't give up. I won't give up my faith, my love for the team, my love for the sport, my RELIGION.I just hope that we get through this phase without much pain and get back up and running... Go India!
I'm at the O'Reilly's Strata conference in NYC, where a NYU professor shared his research on consumer purchasing behavior. He asserts that 'What people "say" is different from what they actually "do"', so he recommends conducting behavioral studies by measuring what people "do", not what they "say". He used camera buying from Amazon as an example, where Zoom is talked about a lot as a feature but has lower impact on purchasing behavior. Whereas Battery Life and Megapixels are talked about less but have higher impact on purchasing behavior. Very interesting. We often conduct such analysis using surveys in enterprises. To do what he suggests will be more accurate but will cost more... it is much more costly to follow what people "do".
We are a global management consulting firm and are looking for data scientists in our team in New York/Washington DC and Gurgaon/Chennai (India). There are full-time and internship (New York) opportunities. There are multiple positions including developing complex models in healthcare, pricing and optimization. If interested, drop me an email at prasoonsharma at gmail dot com.Position specifications- Public contributions (projects, plugins, blogs, open source etc.) on MATLAB, SAS, R, SPSS, Stata a plus- Significant experience in economic and/or scientific programming. Ideally, experience in popular statistical softwares like MATLAB, SAS, R, SPSS, Stata- Ability to use, analyze and visualize large data sets- Demonstrated ability for conceptual analytics including translating design considerations into programmed code- B.A. required, advanced degree preferred (M.A. or Ph.D. in Computer Science, Applied Science, Engineering, Economics, Statistics or similar)Position responsibilitiesRunning and developing complex healthcare models; e.g., behavioral simulation. Specific responsibilities include:- Maintaining and developing model and other analytic assets- Driving individual and team problem solving regarding model architecture and continued development of model and related analytic assets (e.g., ability to independently develop hypotheses, approaches and solutions to development objectives)- Conducting research and analyzing existing data sources to derive solutions and analyses for model development including complex, multi-variate analytical analyses- Overseeing and writing communication materials supporting analytic materials, including communication decks for clients and for client teams- Rigorously reviewing and testing of all programmed code to ensure accuracy/veracity of toolkit development- Working with teams to understand, guide and refine client teams requests- Developing and maintaining work plans etc.
Rahul Dravid is a fantastic cricketer, and a role model for younger generation - focused, hardworking and humble.Dravid recently became the 2nd highest scorer in Test Cricket (Sachin Tendulkar is the leading scorer). His contribution to Indian cricket is enormous and his nickname, "The Wall", is a testimony to his concentration, skill and will that has given India tremendous success in Test cricket. Dravid isn't worshiped in India like Tendulkar. He has played his cricket under Tendulkar's shadow. This isn't because his skill, will or contribution to Indian success is second to anyone. Its just that he lives in the same era as another great cricketer - Tendulkar.Indian test cricket will never be the same when Dravid retires.
I want to learn the heavy-weight of Statistical softwares - SAS. It seems like the default choice for high-end statistics and I want to understand why.I'm working in the healthcare practice in our firm and want to analyze claims and credit data (Terabytes, 50M+ records). Traditional ways (SQL) are limiting and desktop statistical softwares like R and Stata aren't suitable for such large data analysis. Other contenders (Matlab) don't seem to be in the same league.So, its time to take a deep dive into SAS.I'm looking for some advice to create a learning plan...Good booksI like learning by examples and found this on Amazon - Learning SAS by Example: A Programmer's Guide by Ronald P. Cody)I know some R, so this might be interesting - SAS and R: Data Management, Statistical Analysis, and Graphics by Ken Kleinman and Nicholas J. HortonOthers?Good tutorialsI like video tutorials with examples e.g Statistics202.I also like tutorials from a programmer's perspective betterAnything for SAS out there?Good blogsAnything like R-Bloggers out there for SAS?ExpertsWill start exploring this. If you know of someone, please let me know. Good training courses in New York areaPreferably not the ones run by the company themselves. I'm looking for SAS experts who can run hands-on classesSAS interest groups in New York areaI learn well in a study group. Any meetups?
There have been over 3300 cricketers who've played Test and One Day cricket. The youngest player was 14 years old Hasan Raza from Pakistan who played 5 ODIs and 2 Test matches at that age. The oldest player was 52 years old W Rhodes from England, who played 8 Test matches at that age. But they are not the ones who've played the game for the longest time. Sachin Tendulkar has. Sachin started international cricket early and is still playing the game as a 38 year old (not many 38 year olds find a spot in international cricket these days). He has played a lot of Test and One Day cricket matches all these years, and has performed well consistently. As a result, he has broken all batting records in cricket. It is unlikely that any other cricketer will break Tendulkar's batting records anytime soon. Sachin's international cricket career spans 21 years and counting. What an athlete!1) Tendulkar entered the game at the youngest age (debut at 16). Very few cricketers start their international career at that age. He is the 3rd youngest to play the game.2) Tendulkar is now 38 years old and still strong. I wish he plays the game for at least another couple of years. The only other cricketers who comes close to his tenure are Javed Miandad from Pakistan (tenure 21 year, debut at 17 and retired at 38) and Sanath Jayasuriya (tenure 20 year, debut at 20 and retired at 40).3) Tendulkar is now 38 years old and like his peers, he has reduced the number of games he plays. But unlike his peers he's still going strong. His peers seem to be headed towards retirement (see how Ponting's performance is dropping in Test and One day cricket - Runs scored and Scoring rate). Tendulkar is exceptional. He beats the rules (normal distribution for statisticians) and sits at the edge of all distributions - debut age, tenure and performance. When he decides to retire, he will be on the edge of retirement age distribution curve as well.
England take on Sri Lanka in a test series starting today. After watching Cricket world cup 2011 and witnessing India play in quarter final, semi final and the finale in stadiums all over India, I had little appetite for any of the IPL 2011 matches. But now, I'm excited again. This test series will be a thrilling contest considering the two teams are more or less equal:England is at #3 and Sri Lanka are at #4 in ICC test rankingEngland plays well against Sri Lanka at home, and Sri Lanka has an edge in their home conditions. And the overall record is well balancedSri Lanka has won 2 of the last 3 test series against England with 1 draw. Clearly, they have a better track record. However, it is difficult to predict the winner as England has the home advantage and the momentum (Ashes tour and other tests in last 18 months). Sri Lanka has a new captain in Dilshan who will be under pressure. England will be under pressure too with the recent emergence of English cricket and the desire to prove their test cricket status. Let the games begin!
10. Can't crack that hard Sudoku problem?? Use R!9. Want to pick a skill that will give you an early adopter advantage?? Learn R! It is the leading open source statistical and data analysis programming language, and is heating up! 8. Need to run statistical calculations in your software application?? Deploy R! It integrates with many programming languages like Java, Ruby, C++, Python7. Looking for reusable libraries to solve a complex problem?? Get R! It has 2000+ free libraries to use in areas of finance, natural language processing, cluster analysis, optimization, prediction, high performance computing etc. 6. No Windows, No Doors - R runs on all the platforms. Just name it and you got it!! Windows PC, Mac, Linux to name a few5. Did you know how much fun stats can be- Try R!!4. Are you updated with the current trends?? Leading firms like NY Times, Google, Facebook, Bank of America, Pfizer, Merck are all using R, where are you??3. Need to run your own analysis?? Need to solve an optimization problem?? Struggling with Excel or SQL in your model??..... just few statements away - Try R!! 2. Want to create a compelling chart?? Try R! 1. Want the coolest job in 2014?? Learn Statistics. It is the future. Data Scientists will be the sexy job in 2018.
This is a follow-up to a previous question on VRP. I investigated R libraries and several other options to solve VRP and decided to build a custom desktop application using open source libraries from COIN-OR. Screenshots attached below.Leave a comment if you're interested. I will contact you directly.Team: Prasoon, Khaled, James
We've introduced R in the organization!It is running along with the heavy weights of statistical analysis like SAS, SPSS, Matlab. Here's what we did and how we did it...HOW DID IT START?I started learning R last year and loved its simplicity and power. After using it primarily for personal projects, I came across a business problem in which R can be considered a good fit.BUSINESS PROBLEMThe business need was to build a web-based tool for marketing budget optimization - Marketing RoI (Return on Investments) i.e. how should a company that has multiple advertisement channels allocate its marketing budget across multiple channels to maximize profit or customer loyalty or customer life time value (LTV).1) Input: The input to the analysis is the company's historical marketing budget allocation, profit, customer loyalty and LTV. 2) Analysis: This analysis is done in 2 steps.- Step 1) Our experts create a formula that relates the inputs given with RoI and LTV etc. It involves econometric techniques etc.- Step 2) Optimization of the formula when the user conducts what-if analysis by varying total budget and/or spend across individual channels to see its effect on RoI and LTV. The desktop optimization model written in Excel using a commercial Excel plugin.3) Output: Optimized spend across advertising channels and ability to evaluate multiple scenarios to determine optimum marketing mixThe initial version of the tool runs as an Excel model using a commercial Excel plugin. The business objective was to transform this Excel-based single-user application into multi-user web-based application.TECHNICAL SOLUTIONA) Web application: The web forms needed to allow users to input data and run scenarios were simple. We develop web applications using Ruby on Rails on LAMP internally. Ruby on Rails gives us an agile environment to develop software by taking care of routine web application tasks like database connectivity. B) Optimization: Since, the Excel model uses a commercial plugin for step 2, the stakeholders started with the hypothesis of using the same commercial plugin's server version for optimization in the web application too.For this we had to prove a couple of things:1) Optimization of formula from step 12) Integration with web applicationOption 1: Commercial optimization engineWe did a quick spike to test optimization with the commercial optimization plugin's server version and also its integration with Ruby on Rails web application and it was successful. We had to use JRuby to integrate Ruby with plugin's server edition as it provides only Java and .NET API.Option 2: R (Open source)In parallel, we checked if R can be used. R is a leading open source statistical environment.- To solve the optimization problem in R we found a lot of R optimization packages and started testing packages like BB as the formula (from step 1) was non-linear, and had constraints and conditions. We tested BB's SPG function and also tried other generic algorithms. We got good optimization results from R (similar or better compared to commercial optimization engine).- Now we had to check how to integrate R with our web application written in Ruby. We found a number of options like integrating R with Apache (rApache) or integrating R directly with Ruby (rsruby). We decided to use rsruby.We ran a number of proof of concepts with R and shared results with stakeholders. The results were positive in terms of performance as well as the optimized results... So we got better results and that too for free! LESSONS LEARNEDTechnicalYou need to be careful in running it in a shared environment, where it can use all your CPU and memory if it runs for longDon't forget to write unit tests using RUnit for your R codeCapturing exceptions from R and dealing with them properly (appropriate message to users)rsruby installation documentation is good but needs a few tries depending on your Linux distributionrsruby does not run on Windows (wasn't a problem for us as we run our web applications on LAMP)ProcessUser acceptance testing: If you are transforming an Excel-based model into web-version, it is critical to have a fully working example of the Excel model to replicate it in R/other statistical packagesOvercoming the challenges of using new open source software in enterprise: Like most enterprise IT shops, we are used to commercial software as well and the idea of using open source software to do serious work is limited to the most popular open source frameworks like Drupal, Ruby on Rails, Linux. We positioned R as an add-on to our LAMP environment and got a separate virtual server dedicated to it as it is memory hungry.
I have big data problems.I need to analyze 100s of millions of rows of data and tried hard for 2 weeks to see if I can use R for this. My assessment so far from the experiments...1) R is best for data that fits a computer's RAM (so get more RAM if you can).2) R can be used for datasets that don't fit into RAM using Bigmemory and ff packages. However, this technique works well for datasets less than 15 GB. This is in line with the excellent analysis done by Ryan. Another good tutorial for Bigmemory.3) If we need to analyze datasets larger than 15 GB, then SAS, MapReduce and RDBMS :( seem like the only option as they store data on file system and access it as needed.Since MapReduce implementations are clumsy and not business friendly yet, I wonder if its time to explore commercial analytics tools like SAS for big data analytics.Can Stata, Matlab or RevolutionR analyse datasets in the range of 50 - 100GB effectively?Referenceshttp://www.bytemining.com/2010/07/taking-r-to-the-limit-part-i-parallelization-in-r/http://www.austinacl.blogspot.com (image)
India and Pakistan have intense rivalry. They are siblings, who like each other deep down but fight often.Their love isn't apparent. You see it in their appreciation of each other's culture, entertainers and sportsmen, and even politicians e.g. Indian movies and movie stars are popular in Pakistan and Benazir was popular in India. You also see their love when Indian and Pakistani people meet in a 3rd country, where their media and politicians don't brainwash them. Their they get to know each other really well. I'm from India and one of my best friends is from Pakistan.Their fights on the other hand are much more evident. Their fights have translated into wars (1965, 1971, 1999), political battles (UN, US/USSR, local politics), movies (Border) and sport rivalries (Cricket, Hockey). We compete in sporting events with fanaticism and millions follow it closely.One such battle took place in Cricket World Cup 2011. It started with Pakistan's inspiring performance in preliminary rounds and then them knocking out West Indies (the weakest team in quarter finals) in quarter finals easily.India was expected to make it to final eight and did. India had to play Australia (4 time world champions) in quarter finals. Australian cricket isn't as strong anymore with many superstars retiring Gilchrist, Hayden, Warne and McGrath). In a thrilling contest, India beat Australia in Motera, Ahmedabad. Indian bowlers restricted Aussies to 260 and the batsmen scored 261.This setup up a thrilling contest between India and Pakistan for a place in finals.I was in India during Cricket World Cup 2011 and got to witness this battle in Mohali, Chandigarh. This is my travelog for this particular contest.60 HOURS OF INDIA-PAKISTAN WORLD CUP CRICKET THRILLMar 29- Baroda to Mumbai by air (12:30pm - 2pm)- Pick brother from office at 6pm and head to the airport- Dinner at Mumbai airport at 7pm while watching Sri Lanks vs. New Zealand semi-final match- Mumbai to Delhi by air (9:30pm - 11:30pm)Note: There were no direct flights from Baroda to Chandigarh, so I had to go to Mumbai. And then to Delhi from Mumbai as all direct flights from Mumbai to Chandigarh were sold outMar 30- Delhi to Panipat by car (12 midnite to 2:30am)- Dinner at dhaba (street restaurant) in Murthal - yummy food- Slept for 2 hrs. Woke up at 5:30pm. Dressed in Indian team T-shirt and shorts. Grabbed my lucky Indian flag that I bought in Motera, Ahmedabad during India-Australia game- Panipat to Chandigarh by car (6:30am to 10am)- Collected tickets at 10:30am- Drove to Mohali stadium by car- Found entrance 1C (11am) after driving around the stadium twice (need better instructions on the road for stadium gates)- Easy checkin into stadium (20 mins in general first-level security and 2 mins in 1C gate security)- Sat in block A Pavillion Terrace on south side (11:30am). Good luck with seats: no direct sunlight, beautiful weather- Watched preparations (wicket and field care, opening cermony preparations, team preparations)- Toss@2pm- India started batting at 2:30pm- We were snacking, hydrating thru out the innings- Our section was filled with top-brass from Chandigarh, Delhi, Mumbai. Beautiful girls from Delhi, Mumbai, and Chandigarh- Celebrity sighting: Arbaaz Khan was sitting in our section. Amir Khan, Preity Zinta, Indian PM Manmohan Singh and Indian businessmen were sitting in stands above us. Amir acknowledged the crowd, enjoyed the match thruout.- Requested others with camera phones to click our pics and email it to us (eagerly awaiting those pics). I guess its time for me to upgrade my Blackberry and get one with a good camera.- Duel with Pakistani fans sitting above us (noisy, beautiful girl, Afridi's brother, 4 adults, 4 - 6 kids). "Jeetega bhai jeetega India jeetega" would always overshadow "Jeetega bhai jeetega Pakistan jeetega".- Good gesture from crowd: no booing, no Pakistan hai-hai, friendly, sporting, knowledgeable crowd- Indian PM, Manmohan singh and Pak PM, Ghelani, met with the teams during opening. Everyone clapped. No one shouted anything bad! Indians love Sachin a lot and everything he does, so the focus was on him instead of the 2 head of states. Someone even shouted "Sachin for Prime Minister"- India scored less than anticipated batting first (260 instead of 300+). Started well with Sehwag's blisterning (though short) knock but others could not capitalize. Even Tendulkar's inning wasn't impeccable (4 dropped catches)Innings break- Pakistanis start well but wickets keep falling afterwards to push them towards defeat- Pakistani tails crumbles but not before giving the crowd a few anxious moments (Misbah's 6s)- Exciting match, tense crowd. Anyone could've won till 90th over in the match!- Sporting behavior on-field as well. No fights, no verbal duels. Nehra dismissed a close floored catch- Great gestures off-field: India-Pakistan PMs; Players joking, laughing, shaking hands; Pakistan fan holding the Indian flag after Pakistan lost- Match presentations: Sachin man of the match. Stadium acoustics suck. No one could hear the captains during the toss, interviews in the presentation ceremony (it was bad in Motera, Ahmedabad too). Punjab Cricket Association please fix it- Peaceful exit at 11:30pm, decent, helpful crowd. Found car in 10 mins (quite amazing given the number of people and cars)Mar 31- Chandigarh to Panipat by car (12 midnite)- Dinner@dhaba (line hotel) 2am - 2:40am. Yummy tandoori parathe (hail Punjab and Haryana for such great places to eat all along the road)- Reached Panipat at 3:30am and crashed for 7 hours- Panipat to Delhi by car (11:30am to 1:45pm)- Remebered and thanked god for bringing us back safely. Car driving in Delhi, Punjab, Haryana highways is no joke. Its like playing a video game - you play chicken all the time, you expect others to brake for you, lanes have no meaning. If you're tempted to drive yourself, I would advise against it. These roads, laws (or lack thereof) are best known and handled by seasoned drivers in these conditions.- Watched highlights of the game at the airport (thx Hyundai for sponsoring big screens)- Delhi to Mumbai by air (3pm to 5pm)- Mumbai airport to home by car (5:30pm - 6:30pm)Next up, is watching the finale between India and Sri Lanka at Wankhede stadium in Mumbai.
Preliminary rounds are over. Top 8 teams have qualified to quarter finals. No surprises there. No one doubted Australia, England, India, New Zealand, Pakistan, South Africa, Sri Lanka and West Indies to miss the cut. So what's next? A lot of action. Upcoming games are must-watch as the minnows are gone and now top teams battle in knock-out rounds and results could be surprising. So find a good excuse, a comfortable couch and your best buddies to watch the teams fight it out. There’s a lot of cricket played these days. This is the cup that matters.Here are the players and teams to watch...BATSMENAB de Villiers: At his peak performance since ICC Batsman of the Year award in 2009.Sehwag is fired-up and is dangerous when he spends more time in the middle than in the dressing room.Tendulkar has already scored two excellent centuries and should score his 100th ton in ODI and Tests during this World Cup.Sangakkara is slow but steady. Not flamboyant but successful. Watch out for his contributions.Aussie batsmen haven't scored big yet and their form remains a worry. Top Sri Lankan, Indian and South African batsmen have all found some form and are likely to score big in upcoming matches.Indian and West Indian tails have been frail and collapsed often. Get ready for a world cup that might be decided by how the tail-enders use their bats.BOWLERS: Spinners lead the bowling chart, with Afridi leading the pack. He’s the new Kumble. Aussies are missing their lethal bowling attack but Brett Lee is playing his last world cup like a champion.ALL ROUNDERS: All rounder performance has been key to many world cups.Viv Richards in 1979 for West Indies' victoryMohinder Amarnath, Kapil Dev and Madan Lal in 1983 for India's victorySteve Waugh in 1987 for Australia's victoryImran Khan in 1992 for Pakistan's victoryJayasuriya and Arvind de Silva in 1996 for Sri Lanka's victoryWorld Cups in 1999, 2003 and 2007 were decided by Australia's strong batting and bowling performance. Their batting (Ponting, Gilchrist, Waugh brothers, Bevan, Hayden) and bowling (McGrath, Warne, Lee) was superior compared to everyone else. In this World Cup, Yuvraj decided to prove those wrong who claimed that India is going into the World Cup without a genuine all rounder. His all round performance has been key in a couple of victories already. Kallis, the best all rounder in current ODI era, hasn’t wowed yet. Australia is known for great all rounders but notice the lack of all round performance from Aussies in this world cup (no yellow in chart below).TEAMSIt is difficult to predict the winner as the top 4 teams (Australia, India, South Africa, Sri Lanka) are more or less equal, as seen in preliminary rounds where no team was invincible and no team dominated completely. So get ready to cheer a surprise winner. Its not who you think it is!This world cup will be decided by all rounders and tail-end batsmen (to bat 50 overs). Batting and bowling strengths of top teams are more or less equal. Some have stars like Dale Steyn or Brett Lee while others have good pairs like Zaheer Khan and Harbhajan Singh.
I'm introducing R to a few colleagues this week and want to share why learning a software like R is important... Here are a few articles that explain it well... Other reasons?Importance of data science- Couple years ago Google's Chief Economist Hal Varian said that the sexy job in the next ten years will be statisticians. Read the full article (requires registration)The ability to take data - to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it's going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.I think statisticians are part of it, but it's just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills - of being able to access, understand, and communicate the insights you get from data analysis - are going to be extremely important. Managers need to be able to access and understand the data themselves.- Rise of data scientists- Becoming a data scientist- Essential skills for a data scientistWhere R fits?R provides an environment for all tools needed for data science (see the data science process below from Benjamin Fry's thesis).- R is ideal for small data analysis i.e. data that fits in a computer's RAM e.g. data < 10GB. Whereas SQL and search techniques seem good for larger data sets that can fit in one machine and techniques like Hadoop are good for BIG data sets that cannot fit in one machine.- NY times article on R you ready for R?- NY times article on R- R is becoming popular
As we dig deeper into Stata or R debate, a few questions have come up.Question 1: One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?We can rewrite it as-is using for loops in R, which is slow and not elegant. What's the elegant way to write this in R? I haven't used plyr yet... Time to learn it?Link to question on StackOverflow
Recently I came across a complex model written in Access with complex SQL queries all over the place. The engineer who was maintaining it and I did some analysis and agreed that the model was using SQL in an unnatural way (things SQL isn't good at) - complex logic, formatting etc. We agreed to use SQL and a more powerful programming language to re-build the model. The engineer is familiar with Stata, so he quickly wrote Stata code. When I looked at the Stata code, it looked fairly easy to reproduce it in R. I've posted some R commands for the Stata commands I found in that code. What are the advantages of using Stata? Why shouldn't I use R for this?
I've refined the R code to pick the best fantasy soccer team by using more granular player performance data (available publicly). Here are the best overall, home and away teams. The constraints used are: 1) Number of goalkeepers = 12) Number of defenders = 43) Number of mid fielders = 34) Number of strikers = 35) Total team cost = 50 GBP6) Maximum number of players from a team = 2 (most fantasy soccer sites use this)