FØS8300 Data Science - from data to competitive advantage
Course description for academic year 2024/2025
Contents and structure
This course will teach students basic techniques for converting raw data into tidy datasets. A tidy dataset is organised in such a way that it’s easy to apply different statistical, graphic and table-generating routines on the data. The process of generating tidy datasets is quite demanding given that raw data can be not-tidy in unnumbered ways.
The course will use the programming language R and a set of package extensions called «tidyverse» to convert raw data into tidy datasets. A main point in the course is to learn to do this conversion reproducibly. The programming code will be written in chunks inside a Quarto document combining code, graphics, tables and the text of the document. When the document is run raw data will be read in, converted to tidy data, analysed and reported all embedded and integrated into the text of the document. Quarto documents can also be parametrised so reports from different periods can be generated just be changing a few parameters at the start of the document.
A Quarto document contains only plain text. This facilitates the use of Version Control Systems to help control the creation of the document. We will use git/Github as VCS. This is a distributed VCS and makes it possible for a group of people to work together in the development of the document without risking destroying each other’s work. The VNC will also ensure that one has several independent copies of the document and the possibility of going back to earlier versions of the document (to earlier commits).
In more detail the course will cover how to get raw data into R, either by reading datafiles, communicating with databases or reading data via API from external data servers. It will also include a mini course in classic R, but our focus will be coding in the tidyverse style. The participants will be encouraged to embrace coding with so-called pipes. Pipes are integrated as part of R in later versions (it used to be supported via an extension package). In reporting results, we will also mainly use routines from the tidyverse. In addition, we will cover how to use simple linear models to analyse the data.
Learning Outcome
Knowledge
Upon completion, the students should have:
- knowledge of the principles governing a tidy dataset and the general strategies to get from dirty raw data to a highly structured tidy dataset
- knowledge of the advantages of reproducibility in research and the repercussions of ignoring it
- knowledge of the principles of version control systems
- knowledge of some basic principles of the R statistical programming language
- knowledge of some of the principles for informative and attractive presentation of data and results by graphics
Skills
Upon completion of the course, students should:
- be familiar with the R Studio IDE
- be able to solve simple R programming tasks
- be able to read and understand (some) R error messages
- be able use the R integrated help system
- be able to write structured documents containing R code (Quarto documents)
- be able to generate different end formats from structured documents (html, Microsoft Word, pdf (via LaTeX))
- be able to write mathematical symbols and equations in R Markdown (LaTeX math syntax)
- be able by code to control the visibility of programming code, plots, tables and results in documents
- be able to present the results of regression models in dynamic tables
- be able to use the "tidyverse" tools to generate tidy data from "dirty" raw data
- be able to use the concept of "pipes" to write clear and compact R code
- Be able to use the R package ggplot2 to generate graphical representations of data and results
- be able to use the git version control system
- be able to use the git system together with net resources to give a distributed version control system. Distributed version control systems have the potential to make the writing of multi-author papers/theses both safer and more convenient.
- be able to use digital tools to simplify citations and the construction of a list of references
General Competence
Upon completion of the course the students will be able to use a distributed version control system, do some R programming and present results via tables and graphics. They will also be familiar with a distributed version control system.
Entry requirements
Generell studiekompetanse
Recommended previous knowledge
None
Teaching methods
The teaching wil be a combination of lectures and more hands on problem solving seminars. The students will be required to write a number of short term-papers in R Markdown where the tidying of raw data is a main topic.
Compulsory learning activities
None
Assessment
During the course the students will build a portfolio on Github of short papers and other exercises. The portfolio will be graded pass or fail.
Examination support material
All materials are allowed
More about examination support material